Search Results: "qc"

12 April 2022

Sven Hoexter: Emulating Raspi2 like hardware with RaspiOS in 2022

Update of my notes from 2020.
# Download a binary device tree file and matching kernel a good soul uploaded to github
wget https://github.com/vfdev-5/qemu-rpi2-vexpress/raw/master/kernel-qemu-4.4.1-vexpress
wget https://github.com/vfdev-5/qemu-rpi2-vexpress/raw/master/vexpress-v2p-ca15-tc1.dtb
# Download the official Rasbian image without X
wget https://downloads.raspberrypi.org/raspios_lite_armhf/images/raspios_lite_armhf-2022-04-07/2022-04-04-raspios-bullseye-armhf-lite.img.xz
unxz 2022-04-04-raspios-bullseye-armhf-lite.img.xz
# Convert it from the raw image to a qcow2 image and add some space
qemu-img convert -f raw -O qcow2 2022-04-04-raspios-bullseye-armhf-lite.img rasbian.qcow2
qemu-img resize rasbian.qcow2 4G
# make sure we get a user account setup
echo "me:$(echo 'test123' openssl passwd -6 -stdin)" > userconf
sudo guestmount -a rasbian.qcow2 -m /dev/sda1 /mnt
sudo mv userconf /mnt
sudo guestunmount /mnt
# start qemu
qemu-system-arm -m 2048M -M vexpress-a15 -cpu cortex-a15 \
 -kernel kernel-qemu-4.4.1-vexpress -no-reboot \
 -smp 2 -serial stdio \
 -dtb vexpress-v2p-ca15-tc1.dtb -sd rasbian.qcow2 \
 -append "root=/dev/mmcblk0p2 rw rootfstype=ext4 console=ttyAMA0,15200 loglevel=8" \
 -nic user,hostfwd=tcp::5555-:22
# login at the serial console as user me with password test123
sudo -i
# enable ssh
systemctl enable ssh
systemctl start ssh
# resize partition and filesystem
parted /dev/mmcblk0 resizepart 2 100%
resize2fs /dev/mmcblk0p2
Now I can login via ssh and start to play:
ssh me@localhost -p 5555

18 February 2022

Dirk Eddelbuettel: RcppSimdJson 0.1.7 on CRAN: Maintenance

The RcppSimdJson package was updated to release 0.1.7 today. CRAN had sent a note overnight that it triggered LENGTH_1 error (where boolean comparisons happen with longer vectors). That may be debatable in the two cases flagged if one looks at the commit but life being too short to debate this so we just fixed it. The email came in at 04:50h-ish when I was sound asleep, but four hours later the fixed version was on CRAN thanks to the automated processing: RcppSimdJson wraps the fantastic and genuinely impressive simdjson library by Daniel Lemire and collaborators. Via very clever algorithmic engineering to obtain largely branch-free code, coupled with modern C++ and newer compiler instructions, it results in parsing gigabytes of JSON parsed per second which is quite mindboggling. The best-case performance is faster than CPU speed as use of parallel SIMD instructions and careful branch avoidance can lead to less than one cpu cycle per byte parsed; see the video of the talk by Daniel Lemire at QCon (also voted best talk). The very short NEWS entry for this release follows.

Changes in version 0.1.7 (2022-02-18)
  • Two URLs were updated in 'README.md', and Travis artifacts and badges have been removed (Dirk).
  • One unit test file was updated to not trigger a 'LENGTH_1' warning (Dirk closing #76).

Courtesy of my CRANberries, there is also a diffstat report for this release. For questions, suggestions, or issues please use the issue tracker at the GitHub repo. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

5 February 2022

Reproducible Builds: Reproducible Builds in January 2022

Welcome to the January 2022 report from the Reproducible Builds project. In our reports, we try outline the most important things that have been happening in the past month. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.
An interesting blog post was published by Paragon Initiative Enterprises about Gossamer, a proposal for securing the PHP software supply-chain. Utilising code-signing and third-party attestations, Gossamer aims to mitigate the risks within the notorious PHP world via publishing attestations to a transparency log. Their post, titled Solving Open Source Supply Chain Security for the PHP Ecosystem goes into some detail regarding the design, scope and implementation of the system.
This month, the Linux Foundation announced SupplyChainSecurityCon, a conference focused on exploring the security threats affecting the software supply chain, sharing best practices and mitigation tactics. The conference is part of the Linux Foundation s Open Source Summit North America and will take place June 21st 24th 2022, both virtually and in Austin, Texas.

Debian There was a significant progress made in the Debian Linux distribution this month, including:

Other distributions kpcyrd reported on Twitter about the release of version 0.2.0 of pacman-bintrans, an experiment with binary transparency for the Arch Linux package manager, pacman. This new version is now able to query rebuilderd to check if a package was independently reproduced.
In the world of openSUSE, however, Bernhard M. Wiedemann posted his monthly reproducible builds status report.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 199, 200, 201 and 202 to Debian unstable (that were later backported to Debian bullseye-backports by Mattia Rizzolo), as well as made the following changes to the code itself:
  • New features:
    • First attempt at incremental output support with a timeout. Now passing, for example, --timeout=60 will mean that diffoscope will not recurse into any sub-archives after 60 seconds total execution time has elapsed. Note that this is not a fixed/strict timeout due to implementation issues. [ ][ ]
    • Support both variants of odt2txt, including the one provided by the unoconv package. [ ]
  • Bug fixes:
    • Do not return with a UNIX exit code of 0 if we encounter with a file whose human-readable metadata matches literal file contents. [ ]
    • Don t fail if comparing a nonexistent file with a .pyc file (and add test). [ ][ ]
    • If the debian.deb822 module raises any exception on import, re-raise it as an ImportError. This should fix diffoscope on some Fedora systems. [ ]
    • Even if a Sphinx .inv inventory file is labelled The remainder of this file is compressed using zlib, it might not actually be. In this case, don t traceback and simply return the original content. [ ]
  • Documentation:
    • Improve documentation for the new --timeout option due to a few misconceptions. [ ]
    • Drop reference in the manual page claiming the ability to compare non-existent files on the command-line. (This has not been possible since version 32 which was released in September 2015). [ ]
    • Update X has been modified after NT_GNU_BUILD_ID has been applied messages to, for example, not duplicating the full filename in the diffoscope output. [ ]
  • Codebase improvements:
    • Tidy some control flow. [ ]
    • Correct a recompile typo. [ ]
In addition, Alyssa Ross fixed the comparison of CBFS names that contain spaces [ ], Sergei Trofimovich fixed whitespace for compatibility with version 21.12 of the Black source code reformatter [ ] and Zbigniew J drzejewski-Szmek fixed JSON detection with a new version of file [ ].

Testing framework The Reproducible Builds project runs a significant testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Fr d ric Pierret (fepitre):
    • Add Debian bookworm to package set creation. [ ]
  • Holger Levsen:
    • Install the po4a package where appropriate, as it is needed for the Reproducible Builds website job [ ]. In addition, also run the i18n.sh and contributors.sh scripts [ ].
    • Correct some grammar in Debian live image build output. [ ]
    • Shell monitor improvements:
      • Only show the offline node section if there are offline nodes. [ ]
      • Colorise offline nodes. [ ]
      • Shrink screen usage. [ ][ ][ ]
    • Node health check improvements:
      • Detect if live package builds encounter incomplete snapshots. [ ][ ][ ]
      • Detect if a host is running with today s date (when it should be set artificially in the future). [ ]
    • Use the devscripts package from bullseye-backports on Debian nodes. [ ]
    • Use the Munin monitoring package bullseye-backports on Debian nodes too. [ ]
    • Update New Year handling, needed to be able to detect real and fake dates. [ ][ ]
    • Improve the error message of the script that powercycles the arm64 architecture nodes hosted by Codethink. [ ]
  • Mattia Rizzolo:
    • Use the new --timeout option added in diffoscope version 202. [ ]
  • Roland Clobus:
    • Update the build scripts now that the hooks for live builds are now maintained upstream in the live-build repository. [ ]
    • Show info lines in Jenkins when reproducible hooks have been active. [ ]
    • Use unique folders for the artifacts from each live Debian version. [ ]
  • Vagrant Cascadian:
    • Switch the Debian armhf architecture nodes to use new proxy. [ ]
    • Misc. node maintenance. [ ].

Upstream patches The Reproducible Builds project attempts to fix as many currently-unreproducible packages as possible. In January, we wrote a large number of such patches, including:

And finally If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

26 January 2022

Timo Jyrinki: Unboxing Dell XPS 13 - openSUSE Tumbleweed alongside preinstalled Ubuntu

A look at the 2021 model of Dell XPS 13 - available with Linux pre-installed
I received a new laptop for work - a Dell XPS 13. Dell has been long famous for offering certain models with pre-installed Linux as a supported option, and opting for those is nice for moving some euros/dollars from certain PC desktop OS monopoly towards Linux desktop engineering costs. Notably Lenovo also offers Ubuntu and Fedora options on many models these days (like Carbon X1 and P15 Gen 2).
black box

opened box

accessories and a leaflet about Linux support

laptop lifted from the box, closed

laptop with lid open

Ubuntu running

openSUSE runnin
Obviously a smooth, ready-to-rock Ubuntu installation is nice for most people already, but I need openSUSE, so after checking everything is fine with Ubuntu, I continued to install openSUSE Tumbleweed as a dual boot option. As I m a funny little tinkerer, I obviously went with some special things. I wanted:
  • Ubuntu to remain as the reference supported OS on a small(ish) partition, useful to compare to if trying out new development versions of software on openSUSE and finding oddities.
  • openSUSE as the OS consuming most of the space.
  • LUKS encryption for openSUSE without LVM.
  • ext4 s new fancy fast_commit feature in use during filesystem creation.
  • As a result of all that, I ended up juggling back and forth installation screens a couple of times (even more than shown below, and also because I forgot I wanted to use encryption the first time around).
First boots to pre-installed Ubuntu and installation of openSUSE Tumbleweed as the dual-boot option:
(if the embedded video is not shown, use a direct link)
Some notes from the openSUSE installation:
  • openSUSE installer s partition editor apparently does not support resizing or automatically installing side-by-side another Linux distribution, so I did part of the setup completely on my own.
  • Installation package download hanged a couple of times, only passed when I entered a mirror manually. On my TW I ve also noticed download problems recently, there might be a problem with some mirror I need to escalate.
  • The installer doesn t very clearly show encryption status of the target installation - it took me a couple of attempts before I even noticed the small encrypted column and icon (well, very small, see below), which also did not spell out the device mapper name but only the main partition name. In the end it was going to do the right thing right away and use my pre-created encrypted target partition as I wanted, but it could be a better UX. Then again I was doing my very own tweaks anyway.
  • Let s not go to the details why I m so old-fashioned and use ext4 :)
  • openSUSE s installer does not work fine with HiDPI screen. Funnily the tty consoles seem to be fine and with a big font.
  • At the end of the video I install the two GNOME extensions I can t live without, Dash to Dock and Sound Input & Output Device Chooser.

20 January 2022

Sven Hoexter: Running OpenWRT x86 in qemu

Sometimes it's nice for testing purpose to have the OpenWRT userland available locally. Since there is an x86 build available one can just run it within qemu.
wget https://downloads.openwrt.org/releases/21.02.1/targets/x86/64/openwrt-21.02.1-x86-64-generic-squashfs-combined.img.gz
gunzip openwrt-21.02.1-x86-64-generic-squashfs-combined.img.gz
qemu-img convert -f raw -O qcow2 openwrt-21.02.1-x86-64-generic-squashfs-combined.img openwrt-21.02.1.qcow2
qemu-img resize openwrt-21.02.1.qcow2 200M
qemu-system-x86_64 -M q35 \
  -drive file=openwrt-21.02.1.qcow2,id=d0,if=none,bus=0,unit=0 \
  -device ide-hd,drive=d0,bus=ide.0 -nic user,hostfwd=tcp::5556-:22
# you've to change the network configuration to retrieve an IP via
# dhcp for the lan bridge br-lan
vi /etc/config/network
  - change option proto 'static' to 'dhcp'
  - remove IP address and netmask setting
/etc/init.d/network restart
# now you should've an ip out of 10.0.2.0/24
ssh root@localhost -p 5556
# remember ICMP does not work but otherwise you should have
# IP networking available
opkg update
opkg install curl

30 November 2021

Russell Coker: Links November 2021

The Guardian has an amusing article by Sophie Elmhirst about Libertarians buying a cruise ship to make a seasteading project off the coast of Panama [1]. It turns out that you need permits etc to do this and maintaining a ship is expensive. Also you wouldn t want to mine cryptocurrency in a ship cabin as most cabins are small and don t have enough airconditioning to remain pleasant if you dump 1kW or more into the air. NPR has an interesting article about the reaction of the NRA to the Columbine shootings [2]. Seems that some NRA person isn t a total asshole and is sharing their private information, maybe they are dying and are worried about going to hell. David Brin wrote an insightful blog post about the singleton hypothesis where he covers some of the evidence of autocratic societies failing [3]. I think he makes a convincing point about a single centralised government for human society not being viable. But something like the EU on a world wide scale could work well. Ken Shirriff wrote an interesting blog post about reverse engineering the Yamaha DX7 synthesiser [4]. The New York Times has an interesting article about a Baboon troop that became less aggressive after the alpha males all died at once from tuberculosis [5]. They established a new more peaceful culture that has outlived the beta males who avoided tuberculosis. The Guardian has an interesting article about how sequencing the genomes of the entire population can save healthcare costs while improving the health of the population [6]. This is somthing wealthy countries should offer for free to the world population. At a bit under $1000 per test that s only about $7 trillion to test everyone, and of course the price should drop significantly if there were billions of tests being done. The Strategy Bridge has an interesting article about SciFi books that have useful portrayals of military strategy [7]. The co-author is Major General Mick Ryan of the Australian Army which is noteworthy as Major General is the second highest rank in use by the Australian Army at this time. Vice has an interesting article about the co-evolution of penises and vaginas and how a lot of that evolution is based on avoiding impregnation from rape [8]. Cory Doctorow wrote an insightful Medium article about the way that governments could force interoperability through purchasing power [9]. Cory Doctorow wrote an insightful article for Locus Magazine about imagining life after capitalism and how capitalism might be replaced [10]. We need a Star Trek future! Arstechnica has an informative article about new developmenet in the rowhammer category of security attacks on DRAM [11]. It seems that DDR4 with ECC is the best current mitigation technique and that DDR3 with ECC is harder to attack than non-ECC RAM. So the thing to do is use ECC on all workstations and avoid doing security critical things on laptops because they can t juse ECC RAM.

8 November 2021

Michael Ablassmeier: Libvirt/KVM Backup on Debian Bullseye

The libvirt and qemu versions in Debian Bullseye support a new feature that allows for easier backup and recovery of virtual machines. Instead of using snapshots for backup operation, its now possible to enable dirty bitmaps. Other hypervisors tend to call this changed block tracking . Using the new backup begin approach, its not only possible to create live full backups (without having to create an snapshot) but also track the changes between so called checkpoints, which is very useful for incremental backups. Over the course of the last few months, i have been working on a simple backup and recovery utility called virtnbdbackup It uses the pull based approach in the libvirt api set and currently supports: Preparing the virtual machines The dirty bitmap feature is not enabled by default, users can enable it by adding a new capability to a virtual machine configuration:
 <domain type='kvm' id='1' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
 [..]
 <qemu:capabilities>
   <qemu:add capability='incremental-backup'/>
 </qemu:capabilities>
 [..]
 </domain>
To finally enable the feature, power cycle virtual machine once. Creating backups By default, virtnbdbackup saves the virtual machine config, disks and its logfiles to a given target directory. Its also possible to stream the output into a uncompressed zip archive Taking a backup is as simple as:
# virtnbdbackup -d vm2 -l full -o /tmp/WEEKLY_BACKUP/
 [..]
[2021-11-08 15:43:46] INFO virtnbdbackup - main [MainThread]: Backup jobs finished, stopping backup task.
[2021-11-08 15:43:46] INFO virtnbdbackup - main [MainThread]: Finished
# tree /tmp/WEEKLY_BACKUP/
 /tmp/WEEKLY_BACKUP/
  backup.full.11082021154343.log
  checkpoints
    virtnbdbackup.0.xml
  sda.full.data
  vm2.cpt
  vmconfig.virtnbdbackup.0.xml
From that point on, its now possible to create incremental backups:
 # virtnbdbackup -d vm2 -l inc -o /tmp/WEEKLY_BACKUP/
 # tree /tmp/WEEKLY_BACKUP/
 /tmp/WEEKLY_BACKUP/
  backup.full.11082021154343.log
  backup.inc.11082021154538.log
  checkpoints
    virtnbdbackup.0.xml
    virtnbdbackup.1.xml
  sda.full.data
  sda.inc.virtnbdbackup.1.data
  vm2.cpt
  vmconfig.virtnbdbackup.0.xml
  vmconfig.virtnbdbackup.1.xml
Restoring backups The virtnbdrestore utility can be used to reconstruct the backup sets into usable qcow images, like so:
 # virtnbdrestore -a restore -i /tmp/WEEKLY_BACKUP/ -o /tmp/VM_RESTORE
Using the --until option its also possible to only reconstruct the images to a certain checkpoint, allowing for point in time recovery. Restoring single files Via virtnbdmap you can map full backups back into an usable block device, without having to reconstruct the complete backup image:
# virtnbdmap -f /tmp/WEEKLY_BACKUP/sda.full.data
 [..] INFO virtnbdmap - <module> [MainThread]: Done mapping backup image to [/dev/nbd0]
 [..] INFO virtnbdmap - <module> [MainThread]: Press CTRL+C to disconnect
# fdisk -l /dev/nbd0
Disk /dev/nbd0: 2 GiB, 2147483648 bytes, 4194304 sectors
From here, you can either mount the disc and recover single files, or boot from it via:
qemu-img create -b /dev/nbd0 -f qcow2 bootme.qcow2
qemu-system-x86_64 -enable-kvm -m 2000 -hda bootme.qcow2
Check out the README for the full feature set.

1 October 2021

Russell Coker: Getting Started With Kali

Kali is a Debian based distribution aimed at penetration testing. I haven t felt a need to use it in the past because Debian has packages for all the scanning tools I regularly use, and all the rest are free software that can be obtained separately. But I recently decided to try it. Here s the URL to get Kali [1]. For a VM you can get VMWare or VirtualBox images, I chose VMWare as it s the most popular image format and also a much smaller download (2.7G vs 4G). For unknown reasons the torrent for it didn t work (might be a problem with my torrent client). The download link for it was extremely slow in Australia, so I downloaded it to a system in Germany and then copied it from there. I don t want to use either VMWare or VirtualBox because I find KVM/Qemu sufficient to do everything I want and they are in the Main section of Debian, so I needed to convert the image files. Some of the documentation on converting image formats to use with QEMU/KVM says to use a program called kvm-img which doesn t seem to exist, I used qemu-img from the qemu-utils package in Debian/Bullseye. The man page qemu-img(1) doesn t list the types of output format supported by the -O option and the examples returned by a web search show using -O qcow2 . It turns out that the following command will convert the image to raw format which is the format I prefer. I use BTRFS for storing all my VM images and that does all the copy-on-write I need.
qemu-img convert Kali-Linux-2021.3-vmware-amd64.vmdk ../kali
After converting it the file was 500M smaller than the VMWare files (10.2 vs 10.7G). Probably the Kali distribution file could be reduced in size by converting it to raw and then back to VMWare format. The Kali VMWare image is compressed with 7zip which has a good compression ratio, I waited almost 90 minutes for zstd to compress it with -19 and the result was 12% larger than the 7zip file. VMWare apparently likes to use an emulated SCSI controller, I spent some time trying to get that going in KVM. Apparently recent versions of QEMU changed the way this works and therefore older web pages aren t helpful. Also allegedly the SCSI emulation is buggy and unreliable (but I didn t manage to get it going so can t be sure). It turns out that the VM is configured to work with the virtio interface, the initramfs.conf has the configuration option MODULES=most which makes it boot on all common configurations (good work by the initramfs-tools maintainers). The image works well with the Spice display interface, so it doesn t capture my mouse, the window for the VM works the same way as other windows on my desktop and doesn t capture the mouse cursor. I don t know if this level of Spice integration is in Debian now, last time I tested it didn t work that way. I also downloaded Metasploitable [2] which is a VM image designed to be full of security flaws for testing the tools that are in Kali. Again it worked nicely after converting from VMWare to raw format. One thing to note about Metasploitable is that you must not make it available on the public Internet. My home network has NAT for IPv4 but all systems get public IPv6 addresses. It s usually nice that those things just work on VMs but not for this. So I added an iptables command to block IPv6 to /etc/rc.local. Conclusion Installing VMs for both these distributions was quite easy. Most of my time was spent downloading from a slow server, trying to get SCSI emulation working, working out how to convert image files, and testing different compression options. The time spent doing stuff once I knew what to do was very small. Kali has zsh as the default shell, it s quite nice. I ve been happy with bash for decades, but I might end up trying zsh out on other machines.

8 September 2021

Dirk Eddelbuettel: RcppSimdJson 0.1.6 on CRAN: New Upstream 1.0.0 !!

The RcppSimdJson team is happy to share that a new version 0.1.6 arrived on CRAN earlier today. Its release coincides with release 1.0.0 of simdjson itself, which is included in this release too! RcppSimdJson wraps the fantastic and genuinely impressive simdjson library by Daniel Lemire and collaborators. Via very clever algorithmic engineering to obtain largely branch-free code, coupled with modern C++ and newer compiler instructions, it results in parsing gigabytes of JSON parsed per second which is quite mindboggling. The best-case performance is faster than CPU speed as use of parallel SIMD instructions and careful branch avoidance can lead to less than one cpu cycle per byte parsed; see the video of the talk by Daniel Lemire at QCon (also voted best talk). This version brings the new upstream release, thanks to a comprehensive pull request by Daniel Lemire. The short NEWS entry follows.

Changes in version 0.1.6 (2021-09-07)
  • The C++17 dependency was stated more clearly in the DESCRIPTION file (Dirk)
  • The simdjson version was updated to release 1.0.0 (Daniel Lemire in #70)

We should point out that the package still has a dependency on C++17 even though simdjson no longer does. Some of our earlier wrapping code uses it, this could be changed. If you, dear reader, would like to work on this please get in touch. Courtesy of my CRANberries, there is also a diffstat report for this release. For questions, suggestions, or issues please use the issue tracker at the GitHub repo. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

5 September 2021

Reproducible Builds: Reproducible Builds in August 2021

Welcome to the latest report from the Reproducible Builds project. In this post, we round up the important things that happened in the world of reproducible builds in August 2021. As always, if you are interested in contributing to the project, please visit the Contribute page on our website.
There were a large number of talks related to reproducible builds at DebConf21 this year, the 21st annual conference of the Debian Linux distribution (full schedule):
PackagingCon (@PackagingCon) is new conference for developers of package management software as well as their related communities and stakeholders. The virtual event, which is scheduled to take place on the 9th and 10th November 2021, has a mission is to bring different ecosystems together: from Python s pip to Rust s cargo to Julia s Pkg, from Debian apt over Nix to conda and mamba, and from vcpkg to Spack we hope to have many different approaches to package management at the conference . A number of people from reproducible builds community are planning on attending this new conference, and some may even present. Tickets start at $20 USD.
As reported in our May report, the president of the United States signed an executive order outlining policies aimed to improve the cybersecurity in the US. The executive order comes after a number of highly-publicised security problems such as a ransomware attack that affected an oil pipeline between Texas and New York and the SolarWinds hack that affected a large number of US federal agencies. As a followup this month, however, a detailed fact sheet was released announcing a number large-scale initiatives and that will undoubtedly be related to software supply chain security and, as a result, reproducible builds.
Lastly, We ran another productive meeting on IRC in August (original announcement) which ran for just short of two hours. A full set of notes from the meeting is available.

Software development kpcyrd announced an interesting new project this month called I probably didn t backdoor this which is an attempt to be:
a practical attempt at shipping a program and having reasonably solid evidence there s probably no backdoor. All source code is annotated and there are instructions explaining how to use reproducible builds to rebuild the artifacts distributed in this repository from source. The idea is shifting the burden of proof from you need to prove there s a backdoor to we need to prove there s probably no backdoor . This repository is less about code (we re going to try to keep code at a minimum actually) and instead contains technical writing that explains why these controls are effective and how to verify them. You are very welcome to adopt the techniques used here in your projects. ( )
As the project s README goes on the mention: the techniques used to rebuild the binary artifacts are only possible because the builds for this project are reproducible . This was also announced on our mailing list this month in a thread titled i-probably-didnt-backdoor-this: Reproducible Builds for upstreams. kpcyrd also wrote a detailed blog post about the problems surrounding Linux distributions (such as Alpine and Arch Linux) that distribute compiled Python bytecode in the form of .pyc files generated during the build process.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb made a number of changes, including releasing version 180), version 181) and version 182) as well as the following changes:
  • New features:
    • Add support for extracting the signing block from Android APKs. [ ]
    • If we specify a suffix for a temporary file or directory within the code, ensure it starts with an underscore (ie. _ ) to make the generated filenames more human-readable. [ ]
    • Don t include short GCC lines that differ on a single prefix byte either. These are distracting, not very useful and are simply the strings(1) command s idea of the build ID, which is displayed elsewhere in the diff. [ ][ ]
    • Don t include specific .debug-like lines in the ELF-related output, as it is invariably a duplicate of the debug ID that exists better in the readelf(1) differences for this file. [ ]
  • Bug fixes:
    • Add a special case to SquashFS image extraction to not fail if we aren t the superuser. [ ]
    • Only use java -jar /path/to/apksigner.jar if we have an apksigner.jar as newer versions of apksigner in Debian use a shell wrapper script which will be rejected if passed directly to the JVM. [ ]
    • Reduce the maximum line length for calculating Wagner-Fischer, improving the speed of output generation a lot. [ ]
    • Don t require apksigner in order to compare .apk files using apktool. [ ]
    • Update calls (and tests) for the new version of odt2txt. [ ]
  • Output improvements:
    • Mention in the output if the apksigner tool is missing. [ ]
    • Profile diffoscope.diff.linediff and specialize. [ ][ ]
  • Logging improvements:
    • Format debug-level messages related to ELF sections using the diffoscope.utils.format_class. [ ]
    • Print the size of generated reports in the logs (if possible). [ ]
    • Include profiling information in --debug output if --profile is not set. [ ]
  • Codebase improvements:
    • Clarify a comment about the HUGE_TOOLS Python dictionary. [ ]
    • We can pass -f to apktool to avoid creating a strangely-named subdirectory. [ ]
    • Drop an unused File import. [ ]
    • Update the supported & minimum version of Black. [ ]
    • We don t use the logging variable in a specific place, so alias it to an underscore (ie. _ ) instead. [ ]
    • Update some various copyright years. [ ]
    • Clarify a comment. [ ]
  • Test improvements:
    • Update a test to check specific contents of SquashFS listings, otherwise it fails depending on the test systems user ID to username passwd(5) mapping. [ ]
    • Assign seen and expected values to local variables to improve contextual information in failed tests. [ ]
    • Don t print an orphan newline when the source code formatting test passes. [ ]

In addition Santiago Torres Arias added support for Squashfs version 4.5 [ ] and Felix C. Stegerman suggested a number of small improvements to the output of the new APK signing block [ ]. Lastly, Chris Lamb uploaded python-libarchive-c version 3.1-1 to Debian experimental for the new 3.x branch python-libarchive-c is used by diffoscope.

Distribution work In Debian, 68 reviews of packages were added, 33 were updated and 10 were removed this month, adding to our knowledge about identified issues. Two new issue types have been identified too: nondeterministic_ordering_in_todo_items_collected_by_doxygen and kodi_package_captures_build_path_in_source_filename_hash. kpcyrd published another monthly report on their work on reproducible builds within the Alpine and Arch Linux distributions, specifically mentioning rebuilderd, one of the components powering reproducible.archlinux.org. The report also touches on binary transparency, an important component for supply chain security. The @GuixHPC account on Twitter posted an infographic on what fraction of GNU Guix packages are bit-for-bit reproducible: Finally, Bernhard M. Wiedemann posted his monthly reproducible builds status report for openSUSE.

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including: Elsewhere, it was discovered that when supporting various new language features and APIs for Android apps, the resulting APK files that are generated now vary wildly from build to build (example diffoscope output). Happily, it appears that a patch has been committed to the relevant source tree. This was also discussed on our mailing list this month in a thread titled Android desugaring and reproducible builds started by Marcus Hoffmann.

Website and documentation There were quite a few changes to the Reproducible Builds website and documentation this month, including:
  • Felix C. Stegerman:
    • Update the website self-build process to not use the buster-backports suite now that Debian Bullseye is the stable release. [ ]
  • Holger Levsen:
    • Add a new page documenting various package rebuilder solutions. [ ]
    • Add some historical talks and slides from DebConf20. [ ][ ]
    • Various improvements to the history page. [ ][ ][ ]
    • Rename the Comparison protocol documentation category to Verification . [ ]
    • Update links to F-Droid documentation. [ ]
  • Ian Muchina:
    • Increase the font size of titles and de-emphasize event details on the talk page. [ ]
    • Rename the README file to README.md to improve the user experience when browsing the Git repository in a web browser. [ ]
  • Mattia Rizzolo:
    • Drop a position:fixed CSS statement that is negatively affecting with some width settings. [ ]
    • Fix the sizing of the elements inside the side navigation bar. [ ]
    • Show gold level sponsors and above in the sidebar. [ ]
    • Updated the documentation within reprotest to mention how ldconfig conflicts with the kernel variation. [ ]
  • Roland Clobus:
    • Added a ticket number for the issue with the live Cinnamon image and diffoscope. [ ]

Testing framework The Reproducible Builds project runs a testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Holger Levsen:
    • Debian-related changes:
      • Make a large number of changes to support the new Debian bookworm release, including adding it to the dashboard [ ], start scheduling tests [ ], adding suitable Apache redirects [ ] etc. [ ][ ][ ][ ][ ]
      • Make the first build use LANG=C.UTF-8 to match the official Debian build servers. [ ]
      • Only test Debian Live images once a week. [ ]
      • Upgrade all nodes to use Debian Bullseye [ ] [ ]
      • Update README documentation for the Debian Bullseye release. [ ]
    • Other changes:
      • Only include rsync output if the $DEBUG variable is enabled. [ ]
      • Don t try to install mock, a tool used to build Fedora packages some time ago. [ ]
      • Drop an unused function. [ ]
      • Various documentation improvements. [ ][ ]
      • Improve the node health check to detect zombie jobs. [ ]
  • Jessica Clarke (FreeBSD-related changes):
    • Update the location and branch name for the main FreeBSD Git repository. [ ]
    • Correctly ignore the source tarball when comparing build results. [ ]
    • Drop an outdated version number from the documentation. [ ]
  • Mattia Rizzolo:
    • Block F-Droid jobs from running whilst the setup is running. [ ]
    • Enable debugging for the rsync job related to Debian Live images. [ ]
    • Pass BUILD_TAG and BUILD_URL environment for the Debian Live jobs. [ ]
    • Refactor the master_wrapper script to use a Bash array for the parameters. [ ]
    • Prefer YAML s safe_load() function over the unsafe variant. [ ]
    • Use the correct variable in the Apache config to match possible existing files on disk. [ ]
    • Stop issuing HTTP 301 redirects for things that not actually permanent. [ ]
  • Roland Clobus (Debian live image generation):
    • Increase the diffoscope timeout from 120 to 240 minutes; the Cinnamon image should now be able to finish. [ ]
    • Use the new snapshot service. [ ]
    • Make a number of improvements to artifact handling, such as moving the artifacts to the Jenkins host [ ] and correctly cleaning them up at the right time. [ ][ ][ ]
    • Where possible, link to the Jenkins build URL that created the artifacts. [ ][ ]
    • Only allow only one job to run at the same time. [ ]
  • Vagrant Cascadian:
    • Temporarily disable armhf nodes for DebConf21. [ ][ ]

Lastly, if you are interested in contributing to the Reproducible Builds project, please visit the Contribute page on our website. You can get in touch with us via:

2 September 2021

Norbert Preining: Reducing (sparsifying) qcow2 image of Windows10

Since joining Fujitsu I am permanently running a VM (kvm/qemu) with Windows 10 (unfortunately necessary). While the usaged disk space is about 50G, the actual qcow2 file had grown to over 180G, not good. Searching the web the very promising virt-sparsify is mentioned again and again, and the man page gives hope, but as it turns out it is broken and calls qemu-img with incorrect/not-working arguments (see this bug). Another problem seems to be that by default the discard mode seems not to be set to unmap. So here are the stages how I reduced the size of the qcow2 image back down to around 60G.
  1. Turn off the VM
  2. Make sure you are using VirtIO for the disk, select Discard mode: unmap
  3. Boot into Windows, and from an elevated prompt run: Optimize-Volume -DriveLetter C -ReTrim -Verbose
  4. Shut down the VM
  5. Run a dummy conversion which sparsifies the image: qemu-img convert -O qcow2 orig.qcow2 sparse.qcow2
  6. Rename/Backup the images and boot back into Windows.
That helped in my case, without any consequences (till now) for the Windows installation.

27 June 2021

Fran ois Marier: Removing unsafe-inline from Ikiwiki's style-src directive

After moving my Ikiwiki blog to my own server and enabling a basic CSP policy, I decided to see if I could tighten up the policy some more and stop relying on style-src 'unsafe-inline'. This does require that OpenID logins be disabled, but as a bonus, it also removes the need for jQuery to be present on the server.

Revised CSP policy First of all, I visited all of my pages in a Chromium browser and took note of the missing hashes listed in the developer tools console (Firefox doesn't show the missing hashes):
  • 'sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ='
  • 'sha256-j0bVhc2Wj58RJgvcJPevapx5zlVLw6ns6eYzK/hcA04='
  • 'sha256-j6Tt8qv7z2kSc7fUs0YHbrxawwsQcS05fVaX1r2qrbk='
  • 'sha256-p4cncjf0hAIeTSS5tXecf7qTUanDC27KdlKhT9eOsZU='
  • 'sha256-Y6v8OCtFfMmI5mbpwqCreLofmGZQfXYK7jJHCoHvn7A='
  • 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU='
which took care of all of the inline styles. Note that I kept unsafe-inline in the directive since it will be automatically ignored by browsers who understand hashes, but will be honored and make the site work on older browsers. Next I added the new unsafe-hashes source expression along with the hash of the CSS fragment (clear: both) that is present on all pages related to comments in Ikiwiki:
$ echo -n "clear: both"   openssl dgst -sha256 -binary   openssl base64 -A
matwEc6givhWX0+jiSfM1+E5UMk8/UGLdl902bjFBmY=
My final style-src directive is therefore the following:
style-src 'self' 'unsafe-inline' 'unsafe-hashes' 'sha256-4Su6mBWzEIFnH4pAGMOuaeBrstwJN4Z3pq/s1Kn4/KQ=' 'sha256-j0bVhc2Wj58RJgvcJPevapx5zlVLw6ns6eYzK/hcA04=' 'sha256-j6Tt8qv7z2kSc7fUs0YHbrxawwsQcS05fVaX1r2qrbk=' 'sha256-p4cncjf0hAIeTSS5tXecf7qTUanDC27KdlKhT9eOsZU=' 'sha256-Y6v8OCtFfMmI5mbpwqCreLofmGZQfXYK7jJHCoHvn7A=' 'sha256-47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=' 'sha256-matwEc6givhWX0+jiSfM1+E5UMk8/UGLdl902bjFBmY='

Browser compatibility While unsafe-hashes is not yet implemented in Firefox, it happens to work just fine due to a bug (i.e. unsafe-hashes is always enabled whether or not the policy contains it). It's possible that my new CSP policy won't work in Safari, but these CSS clears don't appear to be needed anyways and so it's just going to mean extra CSP reporting noise.

Removing jQuery Since jQuery appears to only be used to provide the authentication system selector UI, I decided to get rid of it. I couldn't find a way to get Ikiwiki to stop pulling it in and so I put the following hack in my Apache config file:
# Disable jQuery.
Redirect 204 /ikiwiki/jquery.fileupload.js
Redirect 204 /ikiwiki/jquery.fileupload-ui.js
Redirect 204 /ikiwiki/jquery.iframe-transport.js
Redirect 204 /ikiwiki/jquery.min.js
Redirect 204 /ikiwiki/jquery.tmpl.min.js
Redirect 204 /ikiwiki/jquery-ui.min.css
Redirect 204 /ikiwiki/jquery-ui.min.js
Redirect 204 /ikiwiki/login-selector/login-selector.js
Replacing the files on disk with an empty reponse seems to work very well and removes a whole lot of code that would otherwise be allowed by the script-src directive of my CSP policy. While there is a slight cosmetic change to the login page, I think the reduction in the attack surface is well worth it.

16 June 2021

Julien Danjou: Python Tools to Try in 2021

Python Tools to Try in 2021The Python programming language is one of the most popular and in huge demand. It is free, has a large community, is intended for the development of projects of varying complexity, is easy to learn, and opens up great opportunities for programmers. To work comfortably with it, you need special Python tools, which are able to simplify your work. We have selected the best Python tools that will be relevant in 2021.

MailtrapAs you may probably know, in order to send an email, you need SMTP (Simple Mail Transfer Protocol). This is because you can't just send a letter to the recipient. It needs to be sent to the server from which the recipient will download this letter using IMAP and POP3.Mailtrap provides an opportunity to send emails in python. Moreover, Mailtrap provides #rest #api to access current emails. It can be used to automate email testing, which will improve your email marketing campaigns. For example, you can check the password recovery form in the Selenium Test and immediately see if an email was sent to the correct address. Then take a new password from the email and try to enter the site with it. Cool, isn't it?

Pros
  • All emails are in one place.
  • Mailtrap provides multiple inboxes.
  • Shared access is present.
  • It is easy to set up.
  • RESTful API

ConsNo visible disadvantages were found.

Django
Python Tools to Try in 2021
Django is a free and open-source full-stack framework. It is one of the most important and popular among Python developers. It helps you move from a prototype to a ready-made working solution in a short time since its main task is to automate processes and speed up work through associations and libraries. It s a great choice for a product launch.You can use Django if at least a few of the following points interest you:
  • There is a need to develop the server-side of the API.
  • You need to develop a web application.
  • In the course of work, many changes are made, you have to constantly deploy the application and make edits.
  • There are many complex tasks that are difficult to solve on your own, and you will need the help of the community.
  • ORM support is needed to avoid accessing the database directly.
  • There is a need to integrate new technologies such as machine learning.
Django is a great Python Web Framework that does its job. It is not for nothing that it is one of the most popular, and is actively used by millions of developers.

ProsDjango has quite a few advantages. It contains a large number of ready-made solutions, which greatly simplifies development. Admin panel, database migration, various forms, user authentication tools are extremely helpful. The structure is very clear and simple.A large community helps to solve almost any problem. Thanks to ORM, there is a high level of security and it is comfortable to work with databases.

ConsDespite its powerful capabilities, Django's Python Web Framework has drawbacks. It is very massive, monolithic, therefore it develops slowly. Despite the many generic modules, the development speed of Django itself is reduced.

CherryPy
Python Tools to Try in 2021
CherryPy is a micro-framework. It is designed to solve specific problems, capable of running the program on any operating system. CherryPy is used in the following cases:
  • To create an application with small code size.
  • There is a need to manage several servers at the same time.
  • You need to monitor the performance of applications.
CherryPy refers to Python Frameworks, which are designed for specific tasks. It's clear, user-friendly, and ideal for Android development.

ProsCherryPy Python tool has a friendly and understandable development environment. This is a functional and complete framework, which can be used to build good applications. The source code is open, so the platform is completely free for developers, and the community, although not too large, is very responsive, and always helps to solve problems.

ConsThere are not so many cons to this Python tool. It is not capable of performing complex tasks and functions, it is intended more for specific solutions, for example, for the development of certain plugins or modules.

Pyramid
Python Tools to Try in 2021
Python Pyramid tool is designed for programming complex objects and solving multifunctional problems. It is used by professional programmers and is traditionally used for identification and routing. It is aimed at a wide audience and is capable of developing API prototypes.It is used in the following cases:
  • You need problem indicator tools to make timely adjustments and edits.
  • You use several programming languages at once;
  • You work with reporting and financial calculations, forecasting;
  • You need to quickly create a simple application.
At the same time, the Python Web Framework Pyramid allows you to create complex applications with great functionality like a translation software.

ProsPyramid does an excellent job of developing basic applications quickly. It is quite flexible and easy to learn. In fact, the key to the success of this framework is that it is completely based on fundamental principles, using simple and basic programming techniques. It is minimalistic, but at the same time offers users a lot of freedom of action. It is able to work with both small applications and powerful multifunctional programs.

ConsIt is difficult to deviate from the basic principles. This Python tool makes the decision for you. Simple programs are very easy to implement. But to do something complex and large-scale, you have to completely immerse yourself in the study of the environment and obey it.

Grok
Python Tools to Try in 2021
Grok is a Python tool that works with templates. Its main task is to eliminate repetitions in the code. If the element is repeated, then the template that was already created earlier is simply applied. This greatly simplifies and speeds up the work.Grok suits developers in the following cases:
  • If a programmer has little experience and is not yet ready to develop his modules.
  • There is a need to quickly develop a simple application.
  • The functionality of the application is simple, straightforward, and the interface does not play a key role.

ProsThe Grok framework is a child of Zope3, which was released earlier. It has a simplified structure of work, easy installation of modules, more capabilities, and better flexibility. It is designed to develop small applications. Yes, it is not intended for complex work, but due to its functionality, it allows you to quickly implement a project.

ConsThe Grok community is not very large, as this Python tool has not gained widespread popularity. Nevertheless, it is used by Python adepts for comfortable development. It is impossible to implement complex tasks on it since the possibilities are quite limited.Grok is one of the best Python Web Frameworks. It is understandable and has enough features for comfortable development.

Web2Py
Python Tools to Try in 2021
Web2Py is a Python tool that has its own IDEwhich, which includes a code editor, debugger, and deployment. It works great without the need for configuration or installation, provides a high level of data security, and is suitable for work on various platforms.Web2Py is great in the following cases:
  • When there is a need to develop something on different operating systems.
  • If there is no way to install and configure the framework.
  • When a high level of data security is required, for example, when developing financial applications or sales performance management tools.
  • If you need to carefully track bugs right during development, and not during the testing phase.

ProsWeb2Py is capable of working with different protocols, has a built-in error tracker, and has a backward compatibility feature that helps to work on the basis of previous versions of the framework. This means that code maintenance becomes much easier and cheaper. It's free, open-source, and very flexible.

ConsAmong the many Python tools, there are not many that require the latest version of the language. Web2Py is one of those and won't work on Python 3 and below. Therefore, you need to constantly monitor the updates.Web2Py does an excellent job of its tasks. It is quite simple and accessible to everyone.

BlueBream
Python Tools to Try in 2021
BlueBream used to be called Zope3 before. It copes well with tasks of the medium and high level of complexity and is suitable for working on serious projects.

ProsThe BlueBream build system is quite powerful and suitable for complex tasks. You can create functional applications on it, and the principle of reuse of components makes the code easier. At the same time, the speed of development increases. The software can be scaled, and a transactional object database provides an easy path to store it. This means that queries are processed quickly and database management is simple.

ConsThis is not a very flexible framework, it is better to know clearly in advance what is required of it. In addition, it cannot withstand heavy loads. When working with 1000 users at the same time, it can crash and give errors. Therefore, it should be used to solve narrow problems.Python frameworks are often designed for specific tasks. BlueBream is one of these and is suitable for applications where database management plays a key role.

ConclusionPython tools come in different forms and have vastly different capabilities. There are quite a few of them, but in 2021 these will be the most popular and in demand. Experienced programmers always choose several development tools for their comfortable work.

14 June 2021

Fran ois Marier: Self-hosting an Ikiwiki blog

8.5 years ago, I moved my blog to Ikiwiki and Branchable. It's now time for me to take the next step and host my blog on my own server. This is how I migrated from Branchable to my own Apache server.

Installing Ikiwiki dependencies Here are all of the extra Debian packages I had to install on my server:
apt install ikiwiki ikiwiki-hosting-common gcc libauthen-passphrase-perl libcgi-formbuilder-perl libcrypt-sslauthen-passphrase-perl libcgi-formbuilder-perl libcrypt-ssleay-perl libjson-xs-perl librpc-xml-perl python-docutils libxml-feed-perl libsearch-xapian-perl libmailtools-perl highlight-common libsearch-xapian-perl xapian-omega
apt install --no-install-recommends ikiwiki-hosting-web libgravatar-url-perl libmail-sendmail-perl libcgi-session-perl
apt purge libnet-openid-consumer-perl
Then I enabled the CGI module in Apache:
a2enmod cgi
and un-commented the following in /etc/apache2/mods-available/mime.conf:
AddHandler cgi-script .cgi

Creating a separate user account Since Ikiwiki needs to regenerate my blog whenever a new article is pushed to the git repo or a comment is accepted, I created a restricted user account for it:
adduser blog
adduser blog sshuser
chsh -s /usr/bin/git-shell blog

git setup Thanks to Branchable storing blogs in git repositories, I was able to import my blog using a simple git clone in /home/blog (the srcdir):
git clone --bare git://feedingthecloud.branchable.com/ source.git
Note that the name of the directory (source.git) is important for the ikiwikihosting plugin to work. Then I pulled the .setup file out of the setup branch in that repo and put it in /home/blog/.ikiwiki/FeedingTheCloud.setup. After that, I deleted the setup branch and the origin remote from that clone:
git branch -d setup
git remote rm origin
Following the recommended git configuration, I created a working directory (the repository) for the blog user to modify the blog as needed:
cd /home/blog/
git clone /home/blog/source.git FeedingTheCloud
I added my own ssh public key to /home/blog/.ssh/authorized_keys so that I could push to the srcdir from my laptop. Finaly, I generated a new ssh key without a passphrase:
ssh-keygen -t ed25519
and added it as deploy key to the GitHub repo which acts as a read-only mirror of my blog.

Ikiwiki config While I started with the Branchable setup file, I changed the following things in it:
adminemail: webmaster@fmarier.org
srcdir: /home/blog/FeedingTheCloud
destdir: /var/www/blog
url: https://feeding.cloud.geek.nz
cgiurl: https://feeding.cloud.geek.nz/blog.cgi
cgi_wrapper: /var/www/blog/blog.cgi
cgi_wrappermode: 675
add_plugins:
- goodstuff
- lockedit
- comments
- blogspam
- sidebar
- attachment
- favicon
- format
- highlight
- search
- theme
- moderatedcomments
- flattr
- calendar
- headinganchors
- notifyemail
- anonok
- autoindex
- date
- relativedate
- htmlbalance
- pagestats
- sortnaturally
- ikiwikihosting
- gitpush
- emailauth
disable_plugins:
- brokenlinks
- fortune
- more
- openid
- orphans
- passwordauth
- progress
- recentchanges
- repolist
- toggle
- txt
sslcookie: 1
cookiejar:
  file: /home/blog/.ikiwiki/cookies
useragent: ikiwiki
git_wrapper: /home/blog/source.git/hooks/post-update
urlalias:
- http://feeds.cloud.geek.nz/
- http://www.feeding.cloud.geek.nz/
owner: francois@fmarier.org
hostname: feeding.cloud.geek.nz
emailauth_sender: login@fmarier.org
allowed_attachments: admin()
Then I created the destdir:
mkdir /var/www/blog
chown blog:blog /var/www/blog
and generated the initial copy of the blog as the blog user:
ikiwiki --setup .ikiwiki/FeedingTheCloud.setup --wrappers --rebuild
One thing that failed to generate properly was the tag cloug (from the pagestats plugin). I have not been able to figure out why it fails to generate any output when run this way, but if I push to the repo and let the git hook handle the rebuilding of the wiki, the tag cloud is generated correctly. Consequently, fixing this is not high on my list of priorities, but if you happen to know what the problem is, please reach out.

Apache config Here's the Apache config I put in /etc/apache2/sites-available/blog.conf:
<VirtualHost *:443>
    ServerName feeding.cloud.geek.nz
    SSLEngine On
    SSLCertificateFile /etc/letsencrypt/live/feeding.cloud.geek.nz/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/feeding.cloud.geek.nz/privkey.pem
    Header set Strict-Transport-Security: "max-age=63072000; includeSubDomains; preload"
    Include /etc/fmarier-org/blog-common
</VirtualHost>
<VirtualHost *:443>
    ServerName www.feeding.cloud.geek.nz
    ServerAlias feeds.cloud.geek.nz
    SSLEngine On
    SSLCertificateFile /etc/letsencrypt/live/feeding.cloud.geek.nz/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/feeding.cloud.geek.nz/privkey.pem
    Redirect permanent / https://feeding.cloud.geek.nz/
</VirtualHost>
<VirtualHost *:80>
    ServerName feeding.cloud.geek.nz
    ServerAlias www.feeding.cloud.geek.nz
    ServerAlias feeds.cloud.geek.nz
    Redirect permanent / https://feeding.cloud.geek.nz/
</VirtualHost>
and the common config I put in /etc/fmarier-org/blog-common:
ServerAdmin webmaster@fmarier.org
DocumentRoot /var/www/blog
LogLevel core:info
CustomLog $ APACHE_LOG_DIR /blog-access.log combined
ErrorLog $ APACHE_LOG_DIR /blog-error.log
AddType application/rss+xml .rss
<Location /blog.cgi>
        Options +ExecCGI
</Location>
before enabling all of this using:
a2ensite blog
apache2ctl configtest
systemctl restart apache2.service
The feeds.cloud.geek.nz domain used to be pointing to Feedburner and so I need to maintain it in order to avoid breaking RSS feeds from folks who added my blog to their reader a long time ago.

Server-side improvements Since I'm now in control of the server configuration, I was able to make several improvements to how my blog is served. First of all, I enabled the HTTP/2 and Brotli modules:
a2enmod http2
a2enmod brotli
and enabled Brotli compression by putting the following in /etc/apache2/conf-available/compression.conf:
<IfModule mod_brotli.c>
  <IfDefine !TRANSFER_COMPRESSION>
    Define TRANSFER_COMPRESSION BROTLI_COMPRESS
  </IfDefine>
</IfModule>
<IfModule mod_deflate.c>
  <IfDefine !TRANSFER_COMPRESSION>
    Define TRANSFER_COMPRESSION DEFLATE
  </IfDefine>
</IfModule>
<IfDefine TRANSFER_COMPRESSION>
  <IfModule mod_filter.c>
    AddOutputFilterByType $ TRANSFER_COMPRESSION  text/html text/plain text/xml text/css text/javascript
    AddOutputFilterByType $ TRANSFER_COMPRESSION  application/x-javascript application/javascript application/ecmascript
    AddOutputFilterByType $ TRANSFER_COMPRESSION  application/rss+xml
    AddOutputFilterByType $ TRANSFER_COMPRESSION  application/xml
  </IfModule>
</IfDefine>
and replacing /etc/apache2/mods-available/deflate.conf with the following:
# Moved to /etc/apache2/conf-available/compression.conf as per https://bugs.debian.org/972632
before enabling this new config:
a2enconf compression
Next, I made my blog available as a Tor onion service by putting the following in /etc/apache2/sites-available/blog.conf:
<VirtualHost *:443>
    ServerName feeding.cloud.geek.nz
    ServerAlias xfdug5vmfi6oh42fp6ahhrqdjcf7ysqat6fkp5dhvde4d7vlkqixrsad.onion
    Header set Onion-Location "http://xfdug5vmfi6oh42fp6ahhrqdjcf7ysqat6fkp5dhvde4d7vlkqixrsad.onion% REQUEST_URI s"
    Header set alt-svc 'h2="xfdug5vmfi6oh42fp6ahhrqdjcf7ysqat6fkp5dhvde4d7vlkqixrsad.onion:443"; ma=315360000; persist=1'
    ... 
<VirtualHost *:80>
    ServerName xfdug5vmfi6oh42fp6ahhrqdjcf7ysqat6fkp5dhvde4d7vlkqixrsad.onion
    Include /etc/fmarier-org/blog-common
</VirtualHost>
Then I followed the Mozilla Observatory recommendations and enabled the following security headers:
Header set Content-Security-Policy: "default-src 'none'; report-uri https://fmarier.report-uri.com/r/d/csp/enforce ; style-src 'self' 'unsafe-inline' ; img-src 'self' https://seccdn.libravatar.org/ ; script-src https://feeding.cloud.geek.nz/ikiwiki/ https://xfdug5vmfi6oh42fp6ahhrqdjcf7ysqat6fkp5dhvde4d7vlkqixrsad.onion/ikiwiki/ http://xfdug5vmfi6oh42fp6ahhrqdjcf7ysqat6fkp5dhvde4d7vlkqixrsad.onion/ikiwiki/ 'unsafe-inline' 'sha256-pA8FbKo4pYLWPDH2YMPqcPMBzbjH/RYj0HlNAHYoYT0=' 'sha256-Kn5E/7OLXYSq+EKMhEBGJMyU6bREA9E8Av9FjqbpGKk=' 'sha256-/BTNlczeBxXOoPvhwvE1ftmxwg9z+WIBJtpk3qe7Pqo=' ; base-uri 'self'; form-action 'self' ; frame-ancestors 'self'"
Header set X-Frame-Options: "SAMEORIGIN"
Header set Referrer-Policy: "same-origin"
Header set X-Content-Type-Options: "nosniff"
Note that the Mozilla Observatory is mistakenly identifying HTTP onion services as insecure, so you can ignore that failure. I also used the Mozilla TLS config generator to improve the TLS config for my server. Then I added security.txt and gpc.json to the root of my git repo and then added the following aliases to put these files in the right place:
Alias /.well-known/gpc.json /var/www/blog/gpc.json
Alias /.well-known/security.txt /var/www/blog/security.txt
I also followed these instructions to create a sitemap for my blog with the following alias:
Alias /sitemap.xml /var/www/blog/sitemap/index.rss
Finally, I simplified a few error pages to save bandwidth:
ErrorDocument 301 " "
ErrorDocument 302 " "
ErrorDocument 404 "Not Found"

Monitoring 404s Another advantage of running my own web server is that I can monitor the 404s easily using logcheck by putting the following in /etc/logcheck/logcheck.logfiles:
/var/log/apache2/blog-error.log 
Based on that, I added a few redirects to point bots and users to the location of my RSS feed:
Redirect permanent /atom /index.atom
Redirect permanent /comments.rss /comments/index.rss
Redirect permanent /comments.atom /comments/index.atom
Redirect permanent /FeedingTheCloud /index.rss
Redirect permanent /feed /index.rss
Redirect permanent /feed/ /index.rss
Redirect permanent /feeds/posts/default /index.rss
Redirect permanent /rss /index.rss
Redirect permanent /rss/ /index.rss
and to tell them to stop trying to fetch obsolete resources:
Redirect gone /~ff/FeedingTheCloud
Redirect gone /gittip_button.png
Redirect gone /ikiwiki.cgi
I also used these 404s to discover a few old Feedburner URLs that I could redirect to the right place using archive.org:
Redirect permanent /feeds/1572545745827565861/comments/default /posts/watch-all-of-your-logs-using-monkeytail/comments.atom
Redirect permanent /feeds/1582328597404141220/comments/default /posts/news-feeds-rssatom-for-mythtvorg-and/comments.atom
...
Redirect permanent /feeds/8490436852808833136/comments/default /posts/recovering-lost-git-commits/comments.atom
Redirect permanent /feeds/963415010433858516/comments/default /posts/debugging-openwrt-routers-by-shipping/comments.atom
I also put the following robots.txt in the git repo in order to stop a bunch of authentication errors coming from crawlers:
User-agent: *
Disallow: /blog.cgi
Disallow: /ikiwiki.cgi

Future improvements There are a few things I'd like to improve on my current setup. The first one is to remove the iwikihosting and gitpush plugins and replace them with a small script which would simply git push to the read-only GitHub mirror. Then I could uninstall the ikiwiki-hosting-common and ikiwiki-hosting-web since that's all I use them for. Next, I would like to have proper support for signed git pushes. At the moment, I have the following in /home/blog/source.git/config:
[receive]
    advertisePushOptions = true
    certNonceSeed = "(random string)"
but I'd like to also reject unsigned pushes. While my blog now has a CSP policy which doesn't rely on unsafe-inline for scripts, it does rely on unsafe-inline for stylesheets. I tried to remove this but the actual calls to allow seemed to be located deep within jQuery and so I gave up. Update: now fixed. Finally, I'd like to figure out a good way to deal with articles which don't currently have comments. At the moment, if you try to subscribe to their comment feed, it returns a 404. For example:
[Sun Jun 06 17:43:12.336350 2021] [core:info] [pid 30591:tid 140253834704640] [client 66.249.66.70:57381] AH00128: File does not exist: /var/www/blog/posts/using-iptables-with-network-manager/comments.atom
This is obviously not ideal since many feed readers will refuse to add a feed which is currently not found even though it could become real in the future. If you know of a way to fix this, please let me know.

13 June 2021

Vincent Fourmond: Solution for QSoas quiz #2: averaging several Y values for the same X value

This post describes two similar solutions to the Quiz #2, using the data files found there. The two solutions described here rely on split-on-values. The first solution is the one that came naturally to me, and is by far the most general and extensible, but the second one is shorter, and doesn't require external script files.
Solution #1 The key to both solution is to separate the original data into a series of datasets that only contain data at a fixed value of x (which corresponds here to a fixed pH), and then process each dataset one by one to extract the average and standard deviation. This first step is done thus:
QSoas> load kcat-vs-ph.dat
QSoas> split-on-values pH x /flags=data
After these commands, the stacks contains a series of datasets bearing the data flag, that each contain a single column of data, as can be seen from the beginnings of a show-stack command:
QSoas> k
Normal stack:
	 F  C	Rows	Segs	Name	
#0	(*) 1	43	1	'kcat-vs-ph_subset_22.dat'
#1	(*) 1	44	1	'kcat-vs-ph_subset_21.dat'
#2	(*) 1	43	1	'kcat-vs-ph_subset_20.dat'
...
Each of these datasets have a meta-data named pH whose value is the original x value from kcat-vs-ph.dat. Now, the idea is to run a stats command on the resulting datasets, extracting the average value of x and its standard deviation, together with the value of the meta pH. The most natural and general way to do this is to use run-for-datasets, using the following script file (named process-one.cmds):
stats /meta=pH /output=true /stats=x_average,x_stddev
So the command looks like:
QSoas> run-for-datasets process-one.cmds flagged:data
This command produces an output file containing, for each flagged dataset, a line containing x_average, x_stddev, and pH. Then, it is just a matter of loading the output file and shuffling the columns in the right order to get the data in the form asked. Overall, this looks like this:
l kcat-vs-ph.dat
split-on-values pH x /flags=data
output result.dat /overwrite=true
run-for-datasets process-one.cmds flagged:data
l result.dat
apply-formula tmp=y2;y2=y;y=x;x=tmp
dataset-options /yerrors=y2
The slight improvement over what is described above is the use of the output command to write the output to a dedicated file (here result.dat), instead of out.dat and ensuring it is overwritten, so that no data remains from previous runs.

Solution #2 The second solution is almost the same as the first one, with two improvements: This yields the following, smaller, solution:
l kcat-vs-ph.dat
split-on-values pH x /flags=data
stats /meta=pH /accumulate=* /stats=x_average,x_stddev /buffers=flagged:data
pop
apply-formula tmp=y2;y2=y;y=x;x=tmp
dataset-options /yerrors=y2


About QSoas QSoas is a powerful open source data analysis program that focuses on flexibility and powerful fitting capacities. It is released under the GNU General Public License. It is described in Fourmond, Anal. Chem., 2016, 88 (10), pp 5050 5052. Current version is 3.0. You can download its source code there (or clone from the GitHub repository) and compile it yourself, or buy precompiled versions for MacOS and Windows there.

30 May 2021

Vincent Fourmond: QSoas quiz #2: averaging several Y values for the same X value

This second quiz may sound like the first one, but in fact, the approach used is completely different. The point is to gather some elementary statistics from a series of experiments performed under different conditions, but with several repeats at the same conditions.
Quiz You are given a file (which you can download there) that contains a series of pH value data: the X column is the pH, the Y column the result of the experiment at the given pH (let's say the measure of the catalytic rate of an enzyme). Your task is to take this data and produce a single dataset which contains, for each pH value, the pH, the average of the results at that pH and the standard deviation. The result should be identical to the following file, and should look like that:
There are several ways to do this, but all ways must rely on stats, and the more natural way in QSoas is to take advantage of split-on-values, which is a very powerful command but somehow hard to master, which is the point of this Quiz.
By the way, the data file is purely synthetic, if you look in the GitHub repository, you'll see how it was generated.

About QSoas QSoas is a powerful open source data analysis program that focuses on flexibility and powerful fitting capacities. It is released under the GNU General Public License. It is described in Fourmond, Anal. Chem., 2016, 88 (10), pp 5050 5052. Current version is 3.0. You can download its source code there (or clone from the GitHub repository) and compile it yourself, or buy precompiled versions for MacOS and Windows there.

28 May 2021

Jonathan McDowell: Trying to understand Kubernetes networking

I previously built a single node Kubernetes cluster as a test environment to learn more about it. The first thing I want to try to understand is its networking. In particular the IP addresses that are listed are all 10.* and my host s network is a 192.168/24. I understand each pod gets its own virtual ethernet interface and associated IP address, and these are generally private within the cluster (and firewalled out other than for exposed services). What does that actually look like?
$ ip route
default via 192.168.53.1 dev enx00e04c6851de
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/24 dev weave proto kernel scope link src 192.168.0.1
192.168.53.0/24 dev enx00e04c6851de proto kernel scope link src 192.168.53.147
Huh. No sign of any way to get to 10.107.66.138 (the IP my echoserver from the previous post is available on directly from the host). What about network interfaces? (under the cut because it s lengthy)
ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: enx00e04c6851de: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:e0:4c:68:51:de brd ff:ff:ff:ff:ff:ff
    inet 192.168.53.147/24 brd 192.168.53.255 scope global dynamic enx00e04c6851de
       valid_lft 41571sec preferred_lft 41571sec
3: wlp1s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 74:d8:3e:70:3b:18 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:18:04:9e:08 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
5: datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether d2:5a:fd:c1:56:23 brd ff:ff:ff:ff:ff:ff
7: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP group default qlen 1000
    link/ether 12:82:8f:ed:c7:bf brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/24 brd 192.168.0.255 scope global weave
       valid_lft forever preferred_lft forever
9: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master datapath state UP group default
    link/ether b6:49:88:d6:6d:84 brd ff:ff:ff:ff:ff:ff
10: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default
    link/ether 6e:6c:03:1d:e5:0e brd ff:ff:ff:ff:ff:ff
11: vxlan-6784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue master datapath state UNKNOWN group default qlen 1000
    link/ether 9a:af:c5:0a:b3:fd brd ff:ff:ff:ff:ff:ff
13: vethwepl534c0a6@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default
    link/ether 1e:ac:f1:85:61:9a brd ff:ff:ff:ff:ff:ff link-netnsid 0
15: vethwepl9ffd6b6@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default
    link/ether 56:ca:71:2a:ab:39 brd ff:ff:ff:ff:ff:ff link-netnsid 1
17: vethwepl62b369d@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default
    link/ether e2:a0:bb:ee:fc:73 brd ff:ff:ff:ff:ff:ff link-netnsid 2
23: vethwepl6669168@if22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default
    link/ether f2:e7:e6:95:e0:61 brd ff:ff:ff:ff:ff:ff link-netnsid 3
That looks like a collection of virtual ethernet devices that are being managed by the weave networking plugin, and presumably partnered inside each pod. They re bridged to the weave interface (the master weave bit). Still no clues about the 10.* range. What about ARP?
ip neigh
192.168.53.1 dev enx00e04c6851de lladdr e4:8d:8c:35:98:d5 DELAY
192.168.0.4 dev datapath lladdr da:22:06:96:50:cb STALE
192.168.0.2 dev weave lladdr 66:eb:ce:16:3c:62 REACHABLE
192.168.53.136 dev enx00e04c6851de lladdr 00:e0:4c:39:f2:54 REACHABLE
192.168.0.6 dev weave lladdr 56:a9:f0:d2:9e:f3 STALE
192.168.0.3 dev datapath lladdr f2:42:c9:c3:08:71 STALE
192.168.0.3 dev weave lladdr f2:42:c9:c3:08:71 REACHABLE
192.168.0.2 dev datapath lladdr 66:eb:ce:16:3c:62 STALE
192.168.0.6 dev datapath lladdr 56:a9:f0:d2:9e:f3 STALE
192.168.0.4 dev weave lladdr da:22:06:96:50:cb STALE
192.168.0.5 dev datapath lladdr fe:6f:1b:14:56:5a STALE
192.168.0.5 dev weave lladdr fe:6f:1b:14:56:5a REACHABLE
Nope. That just looks like addresses on the weave managed bridge. Alright. What about firewalling?
nft list ruleset
table ip nat  
	chain DOCKER  
		iifname "docker0" counter packets 0 bytes 0 return
	 
	chain POSTROUTING  
		type nat hook postrouting priority srcnat; policy accept;
		 counter packets 531750 bytes 31913539 jump KUBE-POSTROUTING
		oifname != "docker0" ip saddr 172.17.0.0/16 counter packets 1 bytes 84 masquerade 
		counter packets 525600 bytes 31544134 jump WEAVE
	 
	chain PREROUTING  
		type nat hook prerouting priority dstnat; policy accept;
		 counter packets 180 bytes 12525 jump KUBE-SERVICES
		fib daddr type local counter packets 23 bytes 1380 jump DOCKER
	 
	chain OUTPUT  
		type nat hook output priority -100; policy accept;
		 counter packets 527005 bytes 31628455 jump KUBE-SERVICES
		ip daddr != 127.0.0.0/8 fib daddr type local counter packets 285425 bytes 17125524 jump DOCKER
	 
	chain KUBE-MARK-DROP  
		counter packets 0 bytes 0 meta mark set mark or 0x8000 
	 
	chain KUBE-MARK-MASQ  
		counter packets 0 bytes 0 meta mark set mark or 0x4000 
	 
	chain KUBE-POSTROUTING  
		mark and 0x4000 != 0x4000 counter packets 4622 bytes 277720 return
		counter packets 0 bytes 0 meta mark set mark xor 0x4000 
		 counter packets 0 bytes 0 masquerade 
	 
	chain KUBE-KUBELET-CANARY  
	 
	chain INPUT  
		type nat hook input priority 100; policy accept;
	 
	chain KUBE-PROXY-CANARY  
	 
	chain KUBE-SERVICES  
		meta l4proto tcp ip daddr 10.96.0.10  tcp dport 9153 counter packets 0 bytes 0 jump KUBE-SVC-JD5MR3NA4I4DYORP
		meta l4proto tcp ip daddr 10.107.66.138  tcp dport 8080 counter packets 1 bytes 60 jump KUBE-SVC-666FUMINWJLRRQPD
		meta l4proto tcp ip daddr 10.111.16.129  tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-EZYNCFY2F7N6OQA2
		meta l4proto tcp ip daddr 10.96.9.41  tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-EDNDUDH2C75GIR6O
		meta l4proto tcp ip daddr 192.168.53.147  tcp dport 443 counter packets 0 bytes 0 jump KUBE-XLB-EDNDUDH2C75GIR6O
		meta l4proto tcp ip daddr 10.96.9.41  tcp dport 80 counter packets 0 bytes 0 jump KUBE-SVC-CG5I4G2RS3ZVWGLK
		meta l4proto tcp ip daddr 192.168.53.147  tcp dport 80 counter packets 0 bytes 0 jump KUBE-XLB-CG5I4G2RS3ZVWGLK
		meta l4proto tcp ip daddr 10.96.0.1  tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-NPX46M4PTMTKRN6Y
		meta l4proto udp ip daddr 10.96.0.10  udp dport 53 counter packets 0 bytes 0 jump KUBE-SVC-TCOU7JCQXEZGVUNU
		meta l4proto tcp ip daddr 10.96.0.10  tcp dport 53 counter packets 0 bytes 0 jump KUBE-SVC-ERIFXISQEP7F7OF4
		 fib daddr type local counter packets 3312 bytes 198720 jump KUBE-NODEPORTS
	 
	chain KUBE-NODEPORTS  
		meta l4proto tcp  tcp dport 31529 counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp  tcp dport 31529 counter packets 0 bytes 0 jump KUBE-SVC-666FUMINWJLRRQPD
		meta l4proto tcp ip saddr 127.0.0.0/8  tcp dport 30894 counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp  tcp dport 30894 counter packets 0 bytes 0 jump KUBE-XLB-EDNDUDH2C75GIR6O
		meta l4proto tcp ip saddr 127.0.0.0/8  tcp dport 32740 counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp  tcp dport 32740 counter packets 0 bytes 0 jump KUBE-XLB-CG5I4G2RS3ZVWGLK
	 
	chain KUBE-SVC-NPX46M4PTMTKRN6Y  
		 counter packets 0 bytes 0 jump KUBE-SEP-Y6PHKONXBG3JINP2
	 
	chain KUBE-SEP-Y6PHKONXBG3JINP2  
		ip saddr 192.168.53.147  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.53.147:6443
	 
	chain WEAVE  
		# match-set weaver-no-masq-local dst  counter packets 135966 bytes 8160820 return
		ip saddr 192.168.0.0/24 ip daddr 224.0.0.0/4 counter packets 0 bytes 0 return
		ip saddr != 192.168.0.0/24 ip daddr 192.168.0.0/24 counter packets 0 bytes 0 masquerade 
		ip saddr 192.168.0.0/24 ip daddr != 192.168.0.0/24 counter packets 33 bytes 2941 masquerade 
	 
	chain WEAVE-CANARY  
	 
	chain KUBE-SVC-JD5MR3NA4I4DYORP  
		  counter packets 0 bytes 0 jump KUBE-SEP-6JI23ZDEH4VLR5EN
		 counter packets 0 bytes 0 jump KUBE-SEP-FATPLMAF37ZNQP5P
	 
	chain KUBE-SEP-6JI23ZDEH4VLR5EN  
		ip saddr 192.168.0.2  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.2:9153
	 
	chain KUBE-SVC-TCOU7JCQXEZGVUNU  
		  counter packets 0 bytes 0 jump KUBE-SEP-JTN4UBVS7OG5RONX
		 counter packets 0 bytes 0 jump KUBE-SEP-4TCKAEJ6POVEFPVW
	 
	chain KUBE-SEP-JTN4UBVS7OG5RONX  
		ip saddr 192.168.0.2  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto udp   counter packets 0 bytes 0 dnat to 192.168.0.2:53
	 
	chain KUBE-SVC-ERIFXISQEP7F7OF4  
		  counter packets 0 bytes 0 jump KUBE-SEP-UPZX2EM3TRFH2ASL
		 counter packets 0 bytes 0 jump KUBE-SEP-KPHYKKPVMB473Z76
	 
	chain KUBE-SEP-UPZX2EM3TRFH2ASL  
		ip saddr 192.168.0.2  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.2:53
	 
	chain KUBE-SEP-4TCKAEJ6POVEFPVW  
		ip saddr 192.168.0.3  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto udp   counter packets 0 bytes 0 dnat to 192.168.0.3:53
	 
	chain KUBE-SEP-KPHYKKPVMB473Z76  
		ip saddr 192.168.0.3  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.3:53
	 
	chain KUBE-SEP-FATPLMAF37ZNQP5P  
		ip saddr 192.168.0.3  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.3:9153
	 
	chain KUBE-SVC-666FUMINWJLRRQPD  
		 counter packets 1 bytes 60 jump KUBE-SEP-LYLDBZYLHY4MT3AQ
	 
	chain KUBE-SEP-LYLDBZYLHY4MT3AQ  
		ip saddr 192.168.0.4  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 1 bytes 60 dnat to 192.168.0.4:8080
	 
	chain KUBE-XLB-EDNDUDH2C75GIR6O  
		 fib saddr type local counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		 fib saddr type local counter packets 0 bytes 0 jump KUBE-SVC-EDNDUDH2C75GIR6O
		 counter packets 0 bytes 0 jump KUBE-SEP-BLQHCYCSXY3NRKLC
	 
	chain KUBE-XLB-CG5I4G2RS3ZVWGLK  
		 fib saddr type local counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		 fib saddr type local counter packets 0 bytes 0 jump KUBE-SVC-CG5I4G2RS3ZVWGLK
		 counter packets 0 bytes 0 jump KUBE-SEP-5XVRKWM672JGTWXH
	 
	chain KUBE-SVC-EDNDUDH2C75GIR6O  
		 counter packets 0 bytes 0 jump KUBE-SEP-BLQHCYCSXY3NRKLC
	 
	chain KUBE-SEP-BLQHCYCSXY3NRKLC  
		ip saddr 192.168.0.5  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.5:443
	 
	chain KUBE-SVC-CG5I4G2RS3ZVWGLK  
		 counter packets 0 bytes 0 jump KUBE-SEP-5XVRKWM672JGTWXH
	 
	chain KUBE-SEP-5XVRKWM672JGTWXH  
		ip saddr 192.168.0.5  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.5:80
	 
	chain KUBE-SVC-EZYNCFY2F7N6OQA2  
		 counter packets 0 bytes 0 jump KUBE-SEP-JYW326XAJ4KK7QPG
	 
	chain KUBE-SEP-JYW326XAJ4KK7QPG  
		ip saddr 192.168.0.5  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
		meta l4proto tcp   counter packets 0 bytes 0 dnat to 192.168.0.5:8443
	 
 
table ip filter  
	chain DOCKER  
	 
	chain DOCKER-ISOLATION-STAGE-1  
		iifname "docker0" oifname != "docker0" counter packets 0 bytes 0 jump DOCKER-ISOLATION-STAGE-2
		counter packets 0 bytes 0 return
	 
	chain DOCKER-ISOLATION-STAGE-2  
		oifname "docker0" counter packets 0 bytes 0 drop
		counter packets 0 bytes 0 return
	 
	chain FORWARD  
		type filter hook forward priority filter; policy drop;
		iifname "weave"  counter packets 213 bytes 54014 jump WEAVE-NPC-EGRESS
		oifname "weave"  counter packets 150 bytes 30038 jump WEAVE-NPC
		oifname "weave" ct state new counter packets 0 bytes 0 log group 86 
		oifname "weave" counter packets 0 bytes 0 drop
		iifname "weave" oifname != "weave" counter packets 33 bytes 2941 accept
		oifname "weave" ct state related,established counter packets 0 bytes 0 accept
		 counter packets 0 bytes 0 jump KUBE-FORWARD
		ct state new  counter packets 0 bytes 0 jump KUBE-SERVICES
		ct state new  counter packets 0 bytes 0 jump KUBE-EXTERNAL-SERVICES
		counter packets 0 bytes 0 jump DOCKER-USER
		counter packets 0 bytes 0 jump DOCKER-ISOLATION-STAGE-1
		oifname "docker0" ct state related,established counter packets 0 bytes 0 accept
		oifname "docker0" counter packets 0 bytes 0 jump DOCKER
		iifname "docker0" oifname != "docker0" counter packets 0 bytes 0 accept
		iifname "docker0" oifname "docker0" counter packets 0 bytes 0 accept
	 
	chain DOCKER-USER  
		counter packets 0 bytes 0 return
	 
	chain KUBE-FIREWALL  
		 mark and 0x8000 == 0x8000 counter packets 0 bytes 0 drop
		ip saddr != 127.0.0.0/8 ip daddr 127.0.0.0/8  ct status dnat counter packets 0 bytes 0 drop
	 
	chain OUTPUT  
		type filter hook output priority filter; policy accept;
		ct state new  counter packets 527014 bytes 31628984 jump KUBE-SERVICES
		counter packets 36324809 bytes 6021214027 jump KUBE-FIREWALL
		meta l4proto != esp  mark and 0x20000 == 0x20000 counter packets 0 bytes 0 drop
	 
	chain INPUT  
		type filter hook input priority filter; policy accept;
		 counter packets 35869492 bytes 5971008896 jump KUBE-NODEPORTS
		ct state new  counter packets 390938 bytes 23457377 jump KUBE-EXTERNAL-SERVICES
		counter packets 36249774 bytes 6030068622 jump KUBE-FIREWALL
		meta l4proto tcp ip daddr 127.0.0.1 tcp dport 6784 fib saddr type != local ct state != related,established  counter packets 0 bytes 0 drop
		iifname "weave" counter packets 907273 bytes 88697229 jump WEAVE-NPC-EGRESS
		counter packets 34809601 bytes 5818213726 jump WEAVE-IPSEC-IN
	 
	chain KUBE-KUBELET-CANARY  
	 
	chain KUBE-PROXY-CANARY  
	 
	chain KUBE-EXTERNAL-SERVICES  
	 
	chain KUBE-NODEPORTS  
		meta l4proto tcp  tcp dport 32196 counter packets 0 bytes 0 accept
		meta l4proto tcp  tcp dport 32196 counter packets 0 bytes 0 accept
	 
	chain KUBE-SERVICES  
	 
	chain KUBE-FORWARD  
		ct state invalid counter packets 0 bytes 0 drop
		 mark and 0x4000 == 0x4000 counter packets 0 bytes 0 accept
		 ct state related,established counter packets 0 bytes 0 accept
		 ct state related,established counter packets 0 bytes 0 accept
	 
	chain WEAVE-NPC-INGRESS  
	 
	chain WEAVE-NPC-DEFAULT  
		# match-set weave-;rGqyMIl1HN^cfDki~Z$3]6!N dst  counter packets 14 bytes 840 accept
		# match-set weave-P.B !ZhkAr5q=XZ?3 tMBA+0 dst  counter packets 0 bytes 0 accept
		# match-set weave-Rzff h:=]JaaJl/G;(XJpGjZ[ dst  counter packets 0 bytes 0 accept
		# match-set weave-]B*(W?)t*z5O17G044[gUo#$l dst  counter packets 0 bytes 0 accept
		# match-set weave-iLgO^ o=U/*%KE[@=W:l~ 9T dst  counter packets 9 bytes 540 accept
	 
	chain WEAVE-NPC  
		ct state related,established counter packets 124 bytes 28478 accept
		ip daddr 224.0.0.0/4 counter packets 0 bytes 0 accept
		# PHYSDEV match --physdev-out vethwe-bridge --physdev-is-bridged counter packets 3 bytes 180 accept
		ct state new counter packets 23 bytes 1380 jump WEAVE-NPC-DEFAULT
		ct state new counter packets 0 bytes 0 jump WEAVE-NPC-INGRESS
	 
	chain WEAVE-NPC-EGRESS-ACCEPT  
		counter packets 48 bytes 3769 meta mark set mark or 0x40000 
	 
	chain WEAVE-NPC-EGRESS-CUSTOM  
	 
	chain WEAVE-NPC-EGRESS-DEFAULT  
		# match-set weave-s_+ChJId4Uy_$ G;WdH ~TK)I src  counter packets 0 bytes 0 jump WEAVE-NPC-EGRESS-ACCEPT
		# match-set weave-s_+ChJId4Uy_$ G;WdH ~TK)I src  counter packets 0 bytes 0 return
		# match-set weave-E1ney4o[ojNrLk.6rOHi;7MPE src  counter packets 31 bytes 2749 jump WEAVE-NPC-EGRESS-ACCEPT
		# match-set weave-E1ney4o[ojNrLk.6rOHi;7MPE src  counter packets 31 bytes 2749 return
		# match-set weave-41s)5vQ^o/xWGz6a20N:~?# E src  counter packets 0 bytes 0 jump WEAVE-NPC-EGRESS-ACCEPT
		# match-set weave-41s)5vQ^o/xWGz6a20N:~?# E src  counter packets 0 bytes 0 return
		# match-set weave-sui%__gZ kX~oZgI_Ttqp=Dp src  counter packets 0 bytes 0 jump WEAVE-NPC-EGRESS-ACCEPT
		# match-set weave-sui%__gZ kX~oZgI_Ttqp=Dp src  counter packets 0 bytes 0 return
		# match-set weave-nmMUaDKV*YkQcP5s?Q[R54Ep3 src  counter packets 17 bytes 1020 jump WEAVE-NPC-EGRESS-ACCEPT
		# match-set weave-nmMUaDKV*YkQcP5s?Q[R54Ep3 src  counter packets 17 bytes 1020 return
	 
	chain WEAVE-NPC-EGRESS  
		ct state related,established counter packets 907425 bytes 88746642 accept
		# PHYSDEV match --physdev-in vethwe-bridge --physdev-is-bridged counter packets 0 bytes 0 return
		fib daddr type local counter packets 11 bytes 640 return
		ip daddr 224.0.0.0/4 counter packets 0 bytes 0 return
		ct state new counter packets 50 bytes 3961 jump WEAVE-NPC-EGRESS-DEFAULT
		ct state new mark and 0x40000 != 0x40000 counter packets 2 bytes 192 jump WEAVE-NPC-EGRESS-CUSTOM
	 
	chain WEAVE-IPSEC-IN  
	 
	chain WEAVE-CANARY  
	 
 
table ip mangle  
	chain KUBE-KUBELET-CANARY  
	 
	chain PREROUTING  
		type filter hook prerouting priority mangle; policy accept;
	 
	chain INPUT  
		type filter hook input priority mangle; policy accept;
		counter packets 35716863 bytes 5906910315 jump WEAVE-IPSEC-IN
	 
	chain FORWARD  
		type filter hook forward priority mangle; policy accept;
	 
	chain OUTPUT  
		type route hook output priority mangle; policy accept;
		counter packets 35804064 bytes 5938944956 jump WEAVE-IPSEC-OUT
	 
	chain POSTROUTING  
		type filter hook postrouting priority mangle; policy accept;
	 
	chain KUBE-PROXY-CANARY  
	 
	chain WEAVE-IPSEC-IN  
	 
	chain WEAVE-IPSEC-IN-MARK  
		counter packets 0 bytes 0 meta mark set mark or 0x20000
	 
	chain WEAVE-IPSEC-OUT  
	 
	chain WEAVE-IPSEC-OUT-MARK  
		counter packets 0 bytes 0 meta mark set mark or 0x20000
	 
	chain WEAVE-CANARY  
	 
 
Wow. That s a lot of nftables entries, but it explains what s going on. We have a nat entry for:
meta l4proto tcp ip daddr 10.107.66.138 tcp dport 8080 counter packets 1 bytes 60 jump KUBE-SVC-666FUMINWJLRRQPD
which ends up going to KUBE-SEP-LYLDBZYLHY4MT3AQ and:
meta l4proto tcp counter packets 1 bytes 60 dnat to 192.168.0.4:8080
So packets headed for our echoserver are eventually ending up in a container that has a local IP address of 192.168.0.4. Which we can see in our routing table via the weave interface. Mystery explained. We can see the ingress for the externally visible HTTP service as well:
meta l4proto tcp ip daddr 192.168.33.147 tcp dport 80 counter packets 0 bytes 0 jump KUBE-XLB-CG5I4G2RS3ZVWGLK
which ends up redirected to:
meta l4proto tcp counter packets 0 bytes 0 dnat to 192.168.0.5:80
So from that we d expect the IP inside the echoserver pod to be 192.168.0.4 and the IP address instead our nginx ingress pod to be 192.168.0.5. Let s look:
root@udon:/# docker ps   grep echoserver
7cbb177bee18   k8s.gcr.io/echoserver                 "/usr/local/bin/run. "   3 days ago   Up 3 days             k8s_echoserver_hello-node-59bffcc9fd-8hkgb_default_c7111c9e-7131-40e0-876d-be89d5ca1812_0
root@udon:/# docker exec -it 7cbb177bee18 /bin/bash
root@hello-node-59bffcc9fd-8hkgb:/# awk '/32 host/   print f    f=$2 ' <<< "$(</proc/net/fib_trie)"   sort -u
127.0.0.1
192.168.0.4
It s a slightly awkward method of determining the local IPs addresses due to the stripped down nature of the container, but it clearly shows the expected 192.168.0.4 address. I ve touched here upon the ability to actually enter a container and have a poke around its running environment by using docker directly. Next step is to use that to investigate what containers have actually been spun up and what they re doing. I ll also revisit networking when I get to the point of building a multi-node cluster, to examine how the bridging between different hosts is done.

9 April 2021

Michael Prokop: A Ceph war story

It all started with the big bang! We nearly lost 33 of 36 disks on a Proxmox/Ceph Cluster; this is the story of how we recovered them. At the end of 2020, we eventually had a long outstanding maintenance window for taking care of system upgrades at a customer. During this maintenance window, which involved reboots of server systems, the involved Ceph cluster unexpectedly went into a critical state. What was planned to be a few hours of checklist work in the early evening turned out to be an emergency case; let s call it a nightmare (not only because it included a big part of the night). Since we have learned a few things from our post mortem and RCA, it s worth sharing those with others. But first things first, let s step back and clarify what we had to deal with. The system and its upgrade One part of the upgrade included 3 Debian servers (we re calling them server1, server2 and server3 here), running on Proxmox v5 + Debian/stretch with 12 Ceph OSDs each (65.45TB in total), a so-called Proxmox Hyper-Converged Ceph Cluster. First, we went for upgrading the Proxmox v5/stretch system to Proxmox v6/buster, before updating Ceph Luminous v12.2.13 to the latest v14.2 release, supported by Proxmox v6/buster. The Proxmox upgrade included updating corosync from v2 to v3. As part of this upgrade, we had to apply some configuration changes, like adjust ring0 + ring1 address settings and add a mon_host configuration to the Ceph configuration. During the first two servers reboots, we noticed configuration glitches. After fixing those, we went for a reboot of the third server as well. Then we noticed that several Ceph OSDs were unexpectedly down. The NTP service wasn t working as expected after the upgrade. The underlying issue is a race condition of ntp with systemd-timesyncd (see #889290). As a result, we had clock skew problems with Ceph, indicating that the Ceph monitors clocks aren t running in sync (which is essential for proper Ceph operation). We initially assumed that our Ceph OSD failure derived from this clock skew problem, so we took care of it. After yet another round of reboots, to ensure the systems are running all with identical and sane configurations and services, we noticed lots of failing OSDs. This time all but three OSDs (19, 21 and 22) were down:
% sudo ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       65.44138 root default
-2       21.81310     host server1
 0   hdd  1.08989         osd.0    down  1.00000 1.00000
 1   hdd  1.08989         osd.1    down  1.00000 1.00000
 2   hdd  1.63539         osd.2    down  1.00000 1.00000
 3   hdd  1.63539         osd.3    down  1.00000 1.00000
 4   hdd  1.63539         osd.4    down  1.00000 1.00000
 5   hdd  1.63539         osd.5    down  1.00000 1.00000
18   hdd  2.18279         osd.18   down  1.00000 1.00000
20   hdd  2.18179         osd.20   down  1.00000 1.00000
28   hdd  2.18179         osd.28   down  1.00000 1.00000
29   hdd  2.18179         osd.29   down  1.00000 1.00000
30   hdd  2.18179         osd.30   down  1.00000 1.00000
31   hdd  2.18179         osd.31   down  1.00000 1.00000
-4       21.81409     host server2
 6   hdd  1.08989         osd.6    down  1.00000 1.00000
 7   hdd  1.08989         osd.7    down  1.00000 1.00000
 8   hdd  1.63539         osd.8    down  1.00000 1.00000
 9   hdd  1.63539         osd.9    down  1.00000 1.00000
10   hdd  1.63539         osd.10   down  1.00000 1.00000
11   hdd  1.63539         osd.11   down  1.00000 1.00000
19   hdd  2.18179         osd.19     up  1.00000 1.00000
21   hdd  2.18279         osd.21     up  1.00000 1.00000
22   hdd  2.18279         osd.22     up  1.00000 1.00000
32   hdd  2.18179         osd.32   down  1.00000 1.00000
33   hdd  2.18179         osd.33   down  1.00000 1.00000
34   hdd  2.18179         osd.34   down  1.00000 1.00000
-3       21.81419     host server3
12   hdd  1.08989         osd.12   down  1.00000 1.00000
13   hdd  1.08989         osd.13   down  1.00000 1.00000
14   hdd  1.63539         osd.14   down  1.00000 1.00000
15   hdd  1.63539         osd.15   down  1.00000 1.00000
16   hdd  1.63539         osd.16   down  1.00000 1.00000
17   hdd  1.63539         osd.17   down  1.00000 1.00000
23   hdd  2.18190         osd.23   down  1.00000 1.00000
24   hdd  2.18279         osd.24   down  1.00000 1.00000
25   hdd  2.18279         osd.25   down  1.00000 1.00000
35   hdd  2.18179         osd.35   down  1.00000 1.00000
36   hdd  2.18179         osd.36   down  1.00000 1.00000
37   hdd  2.18179         osd.37   down  1.00000 1.00000
Our blood pressure increased slightly! Did we just lose all of our cluster? What happened, and how can we get all the other OSDs back? We stumbled upon this beauty in our logs:
kernel: [   73.697957] XFS (sdl1): SB stripe unit sanity check failed
kernel: [   73.698002] XFS (sdl1): Metadata corruption detected at xfs_sb_read_verify+0x10e/0x180 [xfs], xfs_sb block 0xffffffffffffffff
kernel: [   73.698799] XFS (sdl1): Unmount and run xfs_repair
kernel: [   73.699199] XFS (sdl1): First 128 bytes of corrupted metadata buffer:
kernel: [   73.699677] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
kernel: [   73.700205] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
kernel: [   73.700836] 00000020: 62 44 2b c0 e6 22 40 d7 84 3d e1 cc 65 88 e9 d8  bD+.."@..=..e...
kernel: [   73.701347] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
kernel: [   73.701770] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
ceph-disk[4240]: mount: /var/lib/ceph/tmp/mnt.jw367Y: mount(2) system call failed: Structure needs cleaning.
ceph-disk[4240]: ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', u'xfs', '-o', 'noatime,inode64', '--', '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.cdda39ed-5
ceph/tmp/mnt.jw367Y']' returned non-zero exit status 32
kernel: [   73.702162] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
kernel: [   73.702550] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
kernel: [   73.702975] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
kernel: [   73.703373] XFS (sdl1): SB validate failed with error -117.
The same issue was present for the other failing OSDs. We hoped, that the data itself was still there, and only the mounting of the XFS partitions failed. The Ceph cluster was initially installed in 2017 with Ceph jewel/10.2 with the OSDs on filestore (nowadays being a legacy approach to storing objects in Ceph). However, we migrated the disks to bluestore since then (with ceph-disk and not yet via ceph-volume what s being used nowadays). Using ceph-disk introduces these 100MB XFS partitions containing basic metadata for the OSD. Given that we had three working OSDs left, we decided to investigate how to rebuild the failing ones. Some folks on #ceph (thanks T1, ormandj + peetaur!) were kind enough to share how working XFS partitions looked like for them. After creating a backup (via dd), we tried to re-create such an XFS partition on server1. We noticed that even mounting a freshly created XFS partition failed:
synpromika@server1 ~ % sudo mkfs.xfs -f -i size=2048 -m uuid="4568c300-ad83-4288-963e-badcd99bf54f" /dev/sdc1
meta-data=/dev/sdc1              isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
synpromika@server1 ~ % sudo mount /dev/sdc1 /mnt/ceph-recovery
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
cache_node_purge: refcount was 1, not zero (node=0x1d3c400)
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x18800/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x18800/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x24c00/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x24c00/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0xc400/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0xc400/0x1000
releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!found dirty buffer (bulk) on free list!bad magic number
bad magic number
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
releasing dirty buffer (bulk) to free list!mount: /mnt/ceph-recovery: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error.
Ouch. This very much looked related to the actual issue we re seeing. So we tried to execute mkfs.xfs with a bunch of different sunit/swidth settings. Using -d sunit=512 -d swidth=512 at least worked then, so we decided to force its usage in the creation of our OSD XFS partition. This brought us a working XFS partition. Please note, sunit must not be larger than swidth (more on that later!). Then we reconstructed how to restore all the metadata for the OSD (activate.monmap, active, block_uuid, bluefs, ceph_fsid, fsid, keyring, kv_backend, magic, mkfs_done, ready, require_osd_release, systemd, type, whoami). To identify the UUID, we can read the data from ceph --format json osd dump , like this for all our OSDs (Zsh syntax ftw!):
synpromika@server1 ~ % for f in  0..37  ; printf "osd-$f: %s\n" "$(sudo ceph --format json osd dump   jq -r ".osds[]   select(.osd==$f)   .uuid")"
osd-0: 4568c300-ad83-4288-963e-badcd99bf54f
osd-1: e573a17a-ccde-4719-bdf8-eef66903ca4f
osd-2: 0e1b2626-f248-4e7d-9950-f1a46644754e
osd-3: 1ac6a0a2-20ee-4ed8-9f76-d24e900c800c
[...]
Identifying the corresponding raw device for each OSD UUID is possible via:
synpromika@server1 ~ % UUID="4568c300-ad83-4288-963e-badcd99bf54f"
synpromika@server1 ~ % readlink -f /dev/disk/by-partuuid/"$ UUID "
/dev/sdc1
The OSD s key ID can be retrieved via:
synpromika@server1 ~ % OSD_ID=0
synpromika@server1 ~ % sudo ceph auth get osd."$ OSD_ID " -f json 2>/dev/null   jq -r '.[]   .key'
AQCKFpZdm0We[...]
Now we also need to identify the underlying block device:
synpromika@server1 ~ % OSD_ID=0
synpromika@server1 ~ % sudo ceph osd metadata osd."$ OSD_ID " -f json   jq -r '.bluestore_bdev_partition_path'    
/dev/sdc2
With all of this, we reconstructed the keyring, fsid, whoami, block + block_uuid files. All the other files inside the XFS metadata partition are identical on each OSD. So after placing and adjusting the corresponding metadata on the XFS partition for Ceph usage, we got a working OSD hurray! Since we had to fix yet another 32 OSDs, we decided to automate this XFS partitioning and metadata recovery procedure. We had a network share available on /srv/backup for storing backups of existing partition data. On each server, we tested the procedure with one single OSD before iterating over the list of remaining failing OSDs. We started with a shell script on server1, then adjusted the script for server2 and server3. This is the script, as we executed it on the 3rd server. Thanks to this, we managed to get the Ceph cluster up and running again. We didn t want to continue with the Ceph upgrade itself during the night though, as we wanted to know exactly what was going on and why the system behaved like that. Time for RCA! Root Cause Analysis So all but three OSDs on server2 failed, and the problem seems to be related to XFS. Therefore, our starting point for the RCA was, to identify what was different on server2, as compared to server1 + server3. My initial assumption was that this was related to some firmware issues with the involved controller (and as it turned out later, I was right!). The disks were attached as JBOD devices to a ServeRAID M5210 controller (with a stripe size of 512). Firmware state:
synpromika@server1 ~ % sudo storcli64 /c0 show all   grep '^Firmware'
Firmware Package Build = 24.16.0-0092
Firmware Version = 4.660.00-8156
synpromika@server2 ~ % sudo storcli64 /c0 show all   grep '^Firmware'
Firmware Package Build = 24.21.0-0112
Firmware Version = 4.680.00-8489
synpromika@server3 ~ % sudo storcli64 /c0 show all   grep '^Firmware'
Firmware Package Build = 24.16.0-0092
Firmware Version = 4.660.00-8156
This looked very promising, as server2 indeed runs with a different firmware version on the controller. But how so? Well, the motherboard of server2 got replaced by a Lenovo/IBM technician in January 2020, as we had a failing memory slot during a memory upgrade. As part of this procedure, the Lenovo/IBM technician installed the latest firmware versions. According to our documentation, some OSDs were rebuilt (due to the filestore->bluestore migration) in March and April 2020. It turned out that precisely those OSDs were the ones that survived the upgrade. So the surviving drives were created with a different firmware version running on the involved controller. All the other OSDs were created with an older controller firmware. But what difference does this make? Now let s check firmware changelogs. For the 24.21.0-0097 release we found this:
- Cannot create or mount xfs filesystem using xfsprogs 4.19.x kernel 4.20(SCGCQ02027889)
- xfs_info command run on an XFS file system created on a VD of strip size 1M shows sunit and swidth as 0(SCGCQ02056038)
Our XFS problem certainly was related to the controller s firmware. We also recalled that our monitoring system reported different sunit settings for the OSDs that were rebuilt in March and April. For example, OSD 21 was recreated and got different sunit settings:
WARN  server2.example.org  Mount options of /var/lib/ceph/osd/ceph-21      WARN - Missing: sunit=1024, Exceeding: sunit=512
We compared the new OSD 21 with an existing one (OSD 25 on server3):
synpromika@server2 ~ % systemctl show var-lib-ceph-osd-ceph\\x2d21.mount   grep sunit
Options=rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota
synpromika@server3 ~ % systemctl show var-lib-ceph-osd-ceph\\x2d25.mount   grep sunit
Options=rw,noatime,attr2,inode64,sunit=1024,swidth=512,noquota
Thanks to our documentation, we could compare execution logs of their creation:
% diff -u ceph-disk-osd-25.log ceph-disk-osd-21.log
-synpromika@server2 ~ % sudo ceph-disk -v prepare --bluestore /dev/sdj --osd-id 25
+synpromika@server3 ~ % sudo ceph-disk -v prepare --bluestore /dev/sdi --osd-id 21
[...]
-command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdj1
-meta-data=/dev/sdj1              isize=2048   agcount=4, agsize=6272 blks
[...]
+command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdi1
+meta-data=/dev/sdi1              isize=2048   agcount=4, agsize=6336 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
-data     =                       bsize=4096   blocks=25088, imaxpct=25
-         =                       sunit=128    swidth=64 blks
+data     =                       bsize=4096   blocks=25344, imaxpct=25
+         =                       sunit=64     swidth=64 blks
 naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
 log      =internal log           bsize=4096   blocks=1608, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
 realtime =none                   extsz=4096   blocks=0, rtextents=0
[...]
So back then, we even tried to track this down but couldn t make sense of it yet. But now this sounds very much like it is related to the problem we saw with this Ceph/XFS failure. We follow Occam s razor, assuming the simplest explanation is usually the right one, so let s check the disk properties and see what differs:
synpromika@server1 ~ % sudo blockdev --getsz --getsize64 --getss --getpbsz --getiomin --getioopt /dev/sdk
4685545472
2398999281664
512
4096
524288
262144
synpromika@server2 ~ % sudo blockdev --getsz --getsize64 --getss --getpbsz --getiomin --getioopt /dev/sdk
4685545472
2398999281664
512
4096
262144
262144
See the difference between server1 and server2 for identical disks? The getiomin option now reports something different for them:
synpromika@server1 ~ % sudo blockdev --getiomin /dev/sdk            
524288
synpromika@server1 ~ % cat /sys/block/sdk/queue/minimum_io_size
524288
synpromika@server2 ~ % sudo blockdev --getiomin /dev/sdk 
262144
synpromika@server2 ~ % cat /sys/block/sdk/queue/minimum_io_size
262144
It doesn t make sense that the minimum I/O size (iomin, AKA BLKIOMIN) is bigger than the optimal I/O size (ioopt, AKA BLKIOOPT). This leads us to Bug 202127 cannot mount or create xfs on a 597T device, which matches our findings here. But why did this XFS partition work in the past and fails now with the newer kernel version? The XFS behaviour change Now given that we have backups of all the XFS partition, we wanted to track down, a) when this XFS behaviour was introduced, and b) whether, and if so how it would be possible to reuse the XFS partition without having to rebuild it from scratch (e.g. if you would have no working Ceph OSD or backups left). Let s look at such a failing XFS partition with the Grml live system:
root@grml ~ # grml-version
grml64-full 2020.06 Release Codename Ausgehfuahangl [2020-06-24]
root@grml ~ # uname -a
Linux grml 5.6.0-2-amd64 #1 SMP Debian 5.6.14-2 (2020-06-09) x86_64 GNU/Linux
root@grml ~ # grml-hostname grml-2020-06
Setting hostname to grml-2020-06: done
root@grml ~ # exec zsh
root@grml-2020-06 ~ # dpkg -l xfsprogs util-linux
Desired=Unknown/Install/Remove/Purge/Hold
  Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
 / Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
 / Name           Version      Architecture Description
+++-==============-============-============-=========================================
ii  util-linux     2.35.2-4     amd64        miscellaneous system utilities
ii  xfsprogs       5.6.0-1+b2   amd64        Utilities for managing the XFS filesystem
There it s failing, no matter which mount option we try:
root@grml-2020-06 ~ # mount ./sdd1.dd /mnt
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
root@grml-2020-06 ~ # dmesg   tail -30
[...]
[   64.788640] XFS (loop1): SB stripe unit sanity check failed
[   64.788671] XFS (loop1): Metadata corruption detected at xfs_sb_read_verify+0x102/0x170 [xfs], xfs_sb block 0xffffffffffffffff
[   64.788671] XFS (loop1): Unmount and run xfs_repair
[   64.788672] XFS (loop1): First 128 bytes of corrupted metadata buffer:
[   64.788673] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
[   64.788674] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   64.788675] 00000020: 32 b6 dc 35 53 b7 44 96 9d 63 30 ab b3 2b 68 36  2..5S.D..c0..+h6
[   64.788675] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
[   64.788675] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
[   64.788676] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
[   64.788677] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
[   64.788677] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
[   64.788679] XFS (loop1): SB validate failed with error -117.
root@grml-2020-06 ~ # mount -t xfs -o rw,relatime,attr2,inode64,sunit=1024,swidth=512,noquota ./sdd1.dd /mnt/
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error.
32 root@grml-2020-06 ~ # dmesg   tail -1
[   66.342976] XFS (loop1): stripe width (512) must be a multiple of the stripe unit (1024)
root@grml-2020-06 ~ # mount -t xfs -o rw,relatime,attr2,inode64,sunit=512,swidth=512,noquota ./sdd1.dd /mnt/
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
32 root@grml-2020-06 ~ # dmesg   tail -14
[   66.342976] XFS (loop1): stripe width (512) must be a multiple of the stripe unit (1024)
[   80.751277] XFS (loop1): SB stripe unit sanity check failed
[   80.751323] XFS (loop1): Metadata corruption detected at xfs_sb_read_verify+0x102/0x170 [xfs], xfs_sb block 0xffffffffffffffff 
[   80.751324] XFS (loop1): Unmount and run xfs_repair
[   80.751325] XFS (loop1): First 128 bytes of corrupted metadata buffer:
[   80.751327] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
[   80.751328] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   80.751330] 00000020: 32 b6 dc 35 53 b7 44 96 9d 63 30 ab b3 2b 68 36  2..5S.D..c0..+h6
[   80.751331] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
[   80.751331] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
[   80.751332] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
[   80.751333] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
[   80.751334] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
[   80.751338] XFS (loop1): SB validate failed with error -117.
Also xfs_repair doesn t help either:
root@grml-2020-06 ~ # xfs_info ./sdd1.dd
meta-data=./sdd1.dd              isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
root@grml-2020-06 ~ # xfs_repair ./sdd1.dd
Phase 1 - find and verify superblock...
bad primary superblock - bad stripe width in superblock !!!
attempting to find secondary superblock...
..............................................................................................Sorry, could not find valid secondary superblock
Exiting now.
With the SB stripe unit sanity check failed message, we could easily track this down to the following commit fa4ca9c:
% git show fa4ca9c5574605d1e48b7e617705230a0640b6da   cat
commit fa4ca9c5574605d1e48b7e617705230a0640b6da
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jun 5 10:06:16 2018 -0700
    
    xfs: catch bad stripe alignment configurations
    
    When stripe alignments are invalid, data alignment algorithms in the
    allocator may not work correctly. Ensure we catch superblocks with
    invalid stripe alignment setups at mount time. These data alignment
    mismatches are now detected at mount time like this:
    
    XFS (loop0): SB stripe unit sanity check failed
    XFS (loop0): Metadata corruption detected at xfs_sb_read_verify+0xab/0x110, xfs_sb block 0xffffffffffffffff
    XFS (loop0): Unmount and run xfs_repair
    XFS (loop0): First 128 bytes of corrupted metadata buffer:
    0000000091c2de02: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 10 00  XFSB............
    0000000023bff869: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000000cdd8c893: 17 32 37 15 ff ca 46 3d 9a 17 d3 33 04 b5 f1 a2  .27...F=...3....
    000000009fd2844f: 00 00 00 00 00 00 00 04 00 00 00 00 00 00 06 d0  ................
    0000000088e9b0bb: 00 00 00 00 00 00 06 d1 00 00 00 00 00 00 06 d2  ................
    00000000ff233a20: 00 00 00 01 00 00 10 00 00 00 00 01 00 00 00 00  ................
    000000009db0ac8b: 00 00 03 60 e1 34 02 00 08 00 00 02 00 00 00 00  ... .4..........
    00000000f7022460: 00 00 00 00 00 00 00 00 0c 09 0b 01 0c 00 00 19  ................
    XFS (loop0): SB validate failed with error -117.
    
    And the mount fails.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff --git fs/xfs/libxfs/xfs_sb.c fs/xfs/libxfs/xfs_sb.c
index b5dca3c8c84d..c06b6fc92966 100644
--- fs/xfs/libxfs/xfs_sb.c
+++ fs/xfs/libxfs/xfs_sb.c
@@ -278,6 +278,22 @@ xfs_mount_validate_sb(
                return -EFSCORRUPTED;
         
        
+       if (sbp->sb_unit)  
+               if (!xfs_sb_version_hasdalign(sbp)  
+                   sbp->sb_unit > sbp->sb_width  
+                   (sbp->sb_width % sbp->sb_unit) != 0)  
+                       xfs_notice(mp, "SB stripe unit sanity check failed");
+                       return -EFSCORRUPTED;
+                 
+         else if (xfs_sb_version_hasdalign(sbp))   
+               xfs_notice(mp, "SB stripe alignment sanity check failed");
+               return -EFSCORRUPTED;
+         else if (sbp->sb_width)  
+               xfs_notice(mp, "SB stripe width sanity check failed");
+               return -EFSCORRUPTED;
+        
+
+       
        if (xfs_sb_version_hascrc(&mp->m_sb) &&
            sbp->sb_blocksize < XFS_MIN_CRC_BLOCKSIZE)  
                xfs_notice(mp, "v5 SB sanity check failed");
This change is included in kernel versions 4.18-rc1 and newer:
% git describe --contains fa4ca9c5574605d1e48
v4.18-rc1~37^2~14
Now let s try with an older kernel version (4.9.0), using old Grml 2017.05 release:
root@grml ~ # grml-version
grml64-small 2017.05 Release Codename Freedatensuppe [2017-05-31]
root@grml ~ # uname -a
Linux grml 4.9.0-1-grml-amd64 #1 SMP Debian 4.9.29-1+grml.1 (2017-05-24) x86_64 GNU/Linux
root@grml ~ # lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.0 (stretch)
Release:        9.0
Codename:       stretch
root@grml ~ # grml-hostname grml-2017-05
Setting hostname to grml-2017-05: done
root@grml ~ # exec zsh
root@grml-2017-05 ~ #
root@grml-2017-05 ~ # xfs_info ./sdd1.dd
xfs_info: ./sdd1.dd is not a mounted XFS filesystem
1 root@grml-2017-05 ~ # xfs_repair ./sdd1.dd
Phase 1 - find and verify superblock...
bad primary superblock - bad stripe width in superblock !!!
attempting to find secondary superblock...
..............................................................................................Sorry, could not find valid secondary superblock
Exiting now.
1 root@grml-2017-05 ~ # mount ./sdd1.dd /mnt
root@grml-2017-05 ~ # mount -t xfs
/root/sdd1.dd on /mnt type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=512,noquota)
root@grml-2017-05 ~ # ls /mnt
activate.monmap  active  block  block_uuid  bluefs  ceph_fsid  fsid  keyring  kv_backend  magic  mkfs_done  ready  require_osd_release  systemd  type  whoami
root@grml-2017-05 ~ # xfs_info /mnt
meta-data=/dev/loop1             isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Mounting there indeed works! Now, if we mount the filesystem with new and proper sunit/swidth settings using the older kernel, it should rewrite them on disk:
root@grml-2017-05 ~ # mount -t xfs -o sunit=512,swidth=512 ./sdd1.dd /mnt/
root@grml-2017-05 ~ # umount /mnt/
And indeed, mounting this rewritten filesystem then also works with newer kernels:
root@grml-2020-06 ~ # mount ./sdd1.rewritten /mnt/
root@grml-2020-06 ~ # xfs_info /root/sdd1.rewritten
meta-data=/dev/loop1             isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=64    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
root@grml-2020-06 ~ # mount -t xfs                
/root/sdd1.rewritten on /mnt type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota)
FTR: The sunit=512,swidth=512 from the xfs mount option is identical to xfs_info s output sunit=64,swidth=64 (because mount.xfs s sunit value is given in 512-byte block units, see man 5 xfs, and the xfs_info output reported here is in blocks with a block size (bsize) of 4096, so sunit = 512*512 := 64*4096 ). mkfs uses minimum and optimal sizes for stripe unit and stripe width; you can check this e.g. via (note that server2 with fixed firmware version reports proper values, whereas server3 with broken controller firmware reports non-sense):
synpromika@server2 ~ % for i in /sys/block/sd*/queue/ ; do printf "%s: %s %s\n" "$i" "$(cat "$i"/minimum_io_size)" "$(cat "$i"/optimal_io_size)" ; done
[...]
/sys/block/sdc/queue/: 262144 262144
/sys/block/sdd/queue/: 262144 262144
/sys/block/sde/queue/: 262144 262144
/sys/block/sdf/queue/: 262144 262144
/sys/block/sdg/queue/: 262144 262144
/sys/block/sdh/queue/: 262144 262144
/sys/block/sdi/queue/: 262144 262144
/sys/block/sdj/queue/: 262144 262144
/sys/block/sdk/queue/: 262144 262144
/sys/block/sdl/queue/: 262144 262144
/sys/block/sdm/queue/: 262144 262144
/sys/block/sdn/queue/: 262144 262144
[...]
synpromika@server3 ~ % for i in /sys/block/sd*/queue/ ; do printf "%s: %s %s\n" "$i" "$(cat "$i"/minimum_io_size)" "$(cat "$i"/optimal_io_size)" ; done
[...]
/sys/block/sdc/queue/: 524288 262144
/sys/block/sdd/queue/: 524288 262144
/sys/block/sde/queue/: 524288 262144
/sys/block/sdf/queue/: 524288 262144
/sys/block/sdg/queue/: 524288 262144
/sys/block/sdh/queue/: 524288 262144
/sys/block/sdi/queue/: 524288 262144
/sys/block/sdj/queue/: 524288 262144
/sys/block/sdk/queue/: 524288 262144
/sys/block/sdl/queue/: 524288 262144
/sys/block/sdm/queue/: 524288 262144
/sys/block/sdn/queue/: 524288 262144
[...]
This is the underlying reason why the initially created XFS partitions were created with incorrect sunit/swidth settings. The broken firmware of server1 and server3 was the cause of the incorrect settings they were ignored by old(er) xfs/kernel versions, but treated as an error by new ones. Make sure to also read the XFS FAQ regarding How to calculate the correct sunit,swidth values for optimal performance . We also stumbled upon two interesting reads in RedHat s knowledge base: 5075561 + 2150101 (requires an active subscription, though) and #1835947. Am I affected? How to work around it? To check whether your XFS mount points are affected by this issue, the following command line should be useful:
awk '$3 == "xfs" print $2 ' /proc/self/mounts   while read mount ; do echo -n "$mount " ; xfs_info $mount   awk '$0 ~ "swidth" gsub(/.*=/,"",$2); gsub(/.*=/,"",$3); print $2,$3 '   awk '  if ($1 > $2) print "impacted"; else print "OK" ' ; done
If you run into the above situation, the only known solution to get your original XFS partition working again, is to boot into an older kernel version again (4.17 or older), mount the XFS partition with correct sunit/swidth settings and then boot back into your new system (kernel version wise). Lessons learned Thanks: Darshaka Pathirana, Chris Hofstaedtler and Michael Hanscho. Looking for help with your IT infrastructure? Let us know!

31 March 2021

Timo Jyrinki: MotionPhoto / MicroVideo File Formats on Pixel Phones

Google Pixel phones support what they call Motion Photo which is essentially a photo with a short video clip attached to it. They are quite nice since they bring the moment alive, especially as the capturing of the video starts a small moment before the shutter button is pressed. For most viewing programs they simply show as static JPEG photos, but there is more to the files.
I d really love proper Shotwell support for these file formats, so I posted a longish explanation with many of the details in this blog post to a ticket there too. Examples of the newer format are linked there too.
Info posted to Shotwell ticket

There are actually two different formats, an old one that is already obsolete, and a newer current format. The older ones are those that your Pixel phone recorded as MVIMG_[datetime].jpg", and they have the following meta-data:
Xmp.GCamera.MicroVideo                       XmpText     1  1
Xmp.GCamera.MicroVideoVersion XmpText 1 1
Xmp.GCamera.MicroVideoOffset XmpText 7 4022143
Xmp.GCamera.MicroVideoPresentationTimestampUs XmpText 7 1331607
The offset is actually from the end of the file, so one needs to calculate accordingly. But it is exact otherwise, so one simply extract a file with that meta-data information:
#!/bin/bash
#
# Extracts the microvideo from a MVIMG_*.jpg file

# The offset is from the ending of the file, so calculate accordingly
offset=$(exiv2 -p X "$1" grep MicroVideoOffset sed 's/.*\"\(.*\)"/\1/')
filesize=$(du --apparent-size --block=1 "$1" sed 's/^\([0-9]*\).*/\1/')
extractposition=$(expr $filesize - $offset)
echo offset: $offset
echo filesize: $filesize
echo extractposition=$extractposition
dd if="$1" skip=1 bs=$extractposition of="$(basename -s .jpg $1).mp4"
The newer format is recorded in filenames called PXL_[datetime].MP.jpg , and they have a _lot_ of additional metadata:
Xmp.GCamera.MotionPhoto                      XmpText     1  1
Xmp.GCamera.MotionPhotoVersion XmpText 1 1
Xmp.GCamera.MotionPhotoPresentationTimestampUs XmpText 6 233320
Xmp.xmpNote.HasExtendedXMP XmpText 32 E1F7505D2DD64EA6948D2047449F0FFA
Xmp.Container.Directory XmpText 0 type="Seq"
Xmp.Container.Directory[1] XmpText 0 type="Struct"
Xmp.Container.Directory[1]/Container:Item XmpText 0 type="Struct"
Xmp.Container.Directory[1]/Container:Item/Item:Mime XmpText 10 image/jpeg
Xmp.Container.Directory[1]/Container:Item/Item:Semantic XmpText 7 Primary
Xmp.Container.Directory[1]/Container:Item/Item:Length XmpText 1 0
Xmp.Container.Directory[1]/Container:Item/Item:Padding XmpText 1 0
Xmp.Container.Directory[2] XmpText 0 type="Struct"
Xmp.Container.Directory[2]/Container:Item XmpText 0 type="Struct"
Xmp.Container.Directory[2]/Container:Item/Item:Mime XmpText 9 video/mp4
Xmp.Container.Directory[2]/Container:Item/Item:Semantic XmpText 11 MotionPhoto
Xmp.Container.Directory[2]/Container:Item/Item:Length XmpText 7 1679555
Xmp.Container.Directory[2]/Container:Item/Item:Padding XmpText 1 0
Sounds like fun and lots of information. However I didn t see why the length in first item is 0 and I didn t see how to use the latter Length info. But I can use the mp4 headers to extract it:
#!/bin/bash
#
# Extracts the motion part of a MotionPhoto file PXL_*.MP.mp4

extractposition=$(grep --binary --byte-offset --only-matching --text \
-P "\x00\x00\x00\x18\x66\x74\x79\x70\x6d\x70\x34\x32" $1 sed 's/^\([0-9]*\).*/\1/')

dd if="$1" skip=1 bs=$extractposition of="$(basename -s .jpg $1).mp4"
UPDATE: I wrote most of this blog post earlier. When now actually getting to publishing it a week later, I see the obvious ie the Length is again simply the offset from the end of the file so one could do the same less brute force approach as for MVIMG. I ll leave the above as is however for the of binary grepping.(cross-posted to my other blog)

7 March 2021

Louis-Philippe V ronneau: New Year, New OpenPGP Key

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Sun, 07 Mar 2021 13:00:17 -0500
I've recently set up a new OpenPGP key and will be transitioning away from my
old one.
It is a chance for me to start using a OpenPGP hardware token and to transition
to a new personal email address (my main public contact is still my
 @debian.org  address).
Please note that I've partially redacted some email addresses from this
statement to minimise the amount of spam I receive. It shouldn't be hard for
actual humans to follow the instructions below to find the complete addresses.
The old key will continue to be valid for a few months, but will eventually be
revoked.
You might know my old OpenPGP certificate as:
pub   rsa4096/0x7AEAC4EC6AAA0A97 2014-12-22 [expires: 2021-06-02]
      Key fingerprint = 677F 54F1 FA86 81AD 8EC0  BCE6 7AEA C4EC 6AAA 0A97
uid       Louis-Philippe V ronneau <REDACTED@riseup.net>
uid       Louis-Philippe V ronneau (alias) <REDACTED@riseup.net>
uid       Louis-Philippe V ronneau (debian) <REDACTED@debian.org>
My new OpenPGP certificate is:
pub   ed25519/0xE1E5457C8BAD4113 2021-03-06 [expires: 2022-03-06]
      Key fingerprint = F64D 61D3 21F3 CB48 9156  753D E1E5 457C 8BAD 4113
uid       Louis-Philippe V ronneau <REDACTED@veronneau.org>
uid       Louis-Philippe V ronneau <REDACTED@debian.org>
These days, I mostly use my key for Debian and to sign git commit. I don't
really expect you to sign my new key if you had signed my old one.
I've published the new certificate on keys.openpgp.org as well as on my
personal website. You can fetch it like this:
    $ wget -O- https://veronneau.org/media/openpgp.key   gpg --import
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEZ39U8fqGga2OwLzmeurE7GqqCpcFAmBFFM8ACgkQeurE7Gqq
CpcuchAAscAeszdtA+TlCI4YvK5nlk+nJnCnNBSnl7Et+jiNjq8kB/Fud+dWMTXC
Zag8oJkalbbxub0BT0bEAn+BiBunu58E0gd0Xq4syTbqZ5o5IN17S/tfxCD0k1hf
ewrnYZ2l0i5g4YvHGKC+Xv4D+Z84BylnIRaXHqlUdluOVfVYDfLybOAqoktO/KUH
I+vQBwXj0Fr/QAtgiz5Nwh/YHFiU9xMSvr5ozRwAFs6+xfIqFHuVPRRkEN5iVo4D
kkMIz+kFfkoh4aWIP4dgAu39XnEgxwTR9J+4yE8TzCCMzO7xCK0X6vqgPAxYMPvb
RuP4FnGWOnGnlcudCUAUkOaryrwRi+dPQTnNICHTYsvVc7dg+W0EhVUkwEuuEwpI
qtcB/Y5AGhqK0Cc11uXiFjIQwLTgwcUez4F0xrGeqsTtAM5gyRup2w0jbocTuYSh
ZRv/2zwrq/S3xVrUYGqdT+L5odmkBzz9zOwY5WlU2H9CMFOdh71XOv9wWQXan9ou
hLRodeOQ8MinIBP+sX36ol1zg/aP7mCHvRRSBzWt7l3WhVxgZFpNwIfp/RZqU0R4
IEq48mntFhPvHJjFmAKLKK/ckzNMtSn+HWQPJV3HTInKCTu5PTNMU3SAvPHOHEps
V6WWSOPB+1Lm/tlIULDc+0SopWoiWO4NObCSs8zMZHlYPBk5x/KIdQQBFgoAHRYh
BMqnQAcHqBawIC/DzfQlelCyHPqFBQJgRRTPAAoJEPQlelCyHPqFFVEA/1qScaAk
O+eBEE4q0BaJDsqweCS1XCcuQGkQCKi5Zv6kAQChQ96Ve7cKbN/wRkT9pdIhmx01
+CmIsnp3k6N0ZYLLCg==
=onl0
-----END PGP SIGNATURE-----

Next.

Previous.